Cell-to-cell variability in gene expression is an inherent feature of biological systems. Single-cell RNA sequencing can be used to quantify this heterogeneity, but it is prone to strong technical noise. Here, we describe a computational workflow that uses, among others, the BASiCS Bioconductor package to robustly quantify expression variability within and between known cell populations (e.g. experimental conditions or cell types). Within a single homogeneous population of cells, BASiCS estimates interpretable gene- and cell-specific parameters that can be used for normalization and to identify highly and lowly variable genes. Furthermore, we also show how BASiCS can highlight changes in gene expression variability between cell populations, while avoiding confounding effects related to technical noise or changes in overall abundance. Using two publicly available datasets, we guide users through a complete pipeline which includes preliminary steps for quality control and data exploration using the scater and scran Bioconductor packages. Data for the first case study was generated using the Fluidigm@ C1 system, in which spike-in RNA molecules are added in order to quantify technical noise. The second dataset was generated using a droplet-based system, for which spike-in RNA is not available. The latter analysis provides an example, in which differential variability testing reveals insights regarding a possible early cell fate commitment process.
Single-cell RNA-sequencing (scRNA-Seq) is the leading technology to study genome-wide transcriptional heterogeneity in cell populations that remains otherwise undetected in bulk experiments (Stegle, Teichmann, and Marioni 2015; Prakadan, Shalek, and Weitz 2017; Patange, Girvan, and Larson 2018). Its applications range from characterising cell types in immunity (Lönnberg et al. 2017; Villani et al. 2017; Zheng et al. 2017) and development (Ibarra-Soria et al. 2018; Wagner et al. 2018; Pijuan-Sala et al. 2019) to dissecting the mechanism for cell fate commitment events (Goolam et al. 2016; Ohnishi et al. 2014). Within a population of cells, transcriptional heterogeneity can be due a mixture of possible causes. On the broadest level, this heterogeneity can reflect the presence of cell subtypes with distinct expression profiles. In such cases, clustering strategies can be used to select individual cell types and cell states for individual analyses (Kiselev, Andrews, and Hemberg 2019). More subtle variation within a seemingly homogeneous cell population can be due to deterministic or stochastic events, the later being referred to as ‘noise’ (Elowitz et al. 2002).
Classically, extrinsic noise is defined as stochastic fluctuations in cellular components, which is induced by cells residing in different dynamic states (cell size, cell cycle, metabolism, intra- and inter-cellular signalling)(Zopf et al. 2013; Iwamoto, Shindo, and Takahashi 2016; Kiviet et al. 2014). Intrinsic noise on the other hand arises from stochastic effects on biochemical processes such as transcription and translation (Elowitz et al. 2002). It can be modulated by genetic and epigenetic modifications (such as mutations, histone modifications, CpG island length and nucleosome positioning)(Eberwine and Kim 2015; Faure, Schmiedel, and Lehner 2017; Morgan and Marioni 2018) and is usually measured on the single-gene level (Elowitz et al. 2002). However, the observed cell-to-cell variability in gene expression or protein abundance arise from a combination of intrinsic and extrinsic noise, and determnisitic contributions to cell state differences in cell populations. Therefore, scRNA-Seq only allows to capture the molecular pehnotypic variability within cell populations, rather than the individual noise components (Patange, Girvan, and Larson 2018; Ecker et al. 2017). Here, the molecular phenotype is defined via the absolute abundance of intracellular molecules, such as transcripts, proteins, metabolites, etc.
Moreover, technical noise inflates the observed cell-to-cell variability in gene expression (Brennecke et al. 2013). To account for high amounts of technical noise that affects scRNA-seq data, external RNA spike-in molecules (such as the ones introduced by the External RNA Controls Consortium, ERCCs (External RNA Controls Consortium 2005)) can be added to the experiment before sequencing. Fitting a regression trend between the variability and the mean abundance of the ERCC molecules allows the statistical detection of genes the show larger variability than the technical background (Brennecke et al. 2013). Genes that show larger variability compared to spike-in molecules or the average variability are often referred to as ‘highly variable genes’ (HVG) and are used in computational scRNA-Seq analysis to select biologically informative genes for down-stream analysis (Lun, McCarthy, and Marioni 2016). Furthermore, spike-in molecules can be used to normalize gene expression for cells with differences in total mRNA content.
To address these technical issues, the BASiCS Bioconductor package estimates gene- and cell-specific model parameters that can be used to, for example, normalize raw scRNA-Seq counts and to detect highly and lowly variable genes (Catalina A. Vallejos, Marioni, and Richardson 2015a). BASiCS is a hierarchical Bayesian framework that propagates statistical uncerstainty into down-stream analysis and allows the decomposition of total variance into a biological and technical part by incorporating spike-in reads (Catalina A. Vallejos, Marioni, and Richardson 2015b). Furthermore, BASiCS has been extended to perform differential mean expression and differential variability testing (Vallejos, Richardson, and Marioni 2016).
Since the era of RNA sequencing, methods for differential expression testing of transcript counts across two conditions have been developed (Anders and Huber 2010; Robinson, McCarthy, and Smyth 2009). Due to high technical variability and sparsity in scRNA-seq data, new approaches were developed for differential expression testing specifically for scRNA-seq data (Katayama et al. 2013; Kharchenko, Silberstein, and Scadden 2014; Delmans and Hemberg 2016). In contrast to bulk samples, scRNA-seq measures variations in gene expression across a population of cells and can therefore be used to test for changes in expression variability between two conditions. For this, BASiCS compares the gene-specific over-dispersion parameters between two conditions. These parameters are independent of technical noise and can be used as proxy for biolgocial variability (Vallejos, Richardson, and Marioni 2016). Similar to the mean-variability trend observed for normalized scRNA-Seq data (Brennecke et al. 2013), the estimates for over-dispersion parameters decrease with mean expression (Vallejos, Richardson, and Marioni 2016). To correct for this, BASiCS has been extended to model the mean-variability relationship and capture residual over-dispersion estimates that show no association to mean expression. Therefore, this extension allows to test changes in mean expression in parallel to changes in variability (Eling et al. 2018).
Here we present a computational workflow for differential expression and differential variability testing of scRNA-seq data using the previously published BASiCS package (Catalina A. Vallejos, Marioni, and Richardson 2015b; Vallejos, Richardson, and Marioni 2016; Eling et al. 2017) implemented in the statistical computing language R. We highlight the use of the scater and scran Bioconductor packages to perform initial quality control and low-level analysis (McCarthy et al. 2017; Lun, McCarthy, and Marioni 2016), which is comparable to analyses performed using BASiCS. Two case studies exemplify the use of BASiCS for non-UMI and UMI scRNA-Seq data. In the first case, BASiCS can be used to detect highly and lowly variable genes and to obtain robust, gene-specific estimates to assess biological variability in naive CD4+ T cells (Martinez-Jimenez et al. 2017); for a similar workflow see (Kim, Lee, and Kim 2019). Furthermore, we compare naive to activated CD4+ T cells to highlight the use of BASiCS to test for changes in mean expression and expression variability. In the second case, we use droplet-based scRNA-Seq data to detect more subtle transcriptional changes during embryonic somitogenesis (Ibarra-Soria et al. 2018)
Methods: We describe the main Bioconductor packages used in this worflow and specifically highlight different settings for the MCMC sampler implemented in the BASiCS Bioconductor package.
C1 Fluidigm data: We describe an end-to-end workflow to obtain, process, quality filter raw expression counts of CD4+ T cells and how to perform analysis using BASiCS. This section is sub-divided into a single group case, where BASiCS is used to normalize data and to detect highly and lowly variable genes, and a two group case, for which we perform differential mean expression and differential variability testing.
10X Genomics data: We highlight the use of BASiCS for scRNA-Seq data that do not contain technical spike-in transcript. In this case, technical noise is estimated via experimental replication. Similar to the CD4+ T cells case, we perform differential mean expression and differential variability testing to detect early cell fate decision events in mammalian somitogenesis.
Figure 1: Computational workflow highlighting the use of BASiCS to analyse single groups or two groups of cells
HVG: highly variable genes; LVG: lowly variable genes.
In this workflow, we use a number of Bioconductor and CRAN packages for different parts of the analyses. Here, we will briefly highlight the main functionalities of these packages from a users perspective.
The SingleCellExperiment Bioconductor package offers a S4 class container to store scRNA-Seq count data and their associated gene- and cell-specific metadata. Throughout this workflow, we perform analyses using a SingleCellExperiment object that contains (at least) the raw expression counts, gene identifiers, batch information and cell identifiers. The SingleCellExperiment function uses a SummarizedExperiment and takes a the raw expression counts in form of a list, and the cell- and gene-specific metadata in form of data frames.
The scater Bioconductor package has been developed to facilitate the handling of scRNA-Seq data (McCarthy et al. 2017). Its main functions include the calculation of quality control (QC) metrics, the visualization of such metrics and expression counts, and the normalization of expression counts. We will primarily use the calculateQCMetrics function to calculate cell- and gene-specific QC metrics, the plotPCA function to visualize QC metrics, metadata and gene expression, and the normalize function to pre-normalize expression data.
The scran Bioconductor package offers a variety of functions for low-level scRNA-Seq data analysis (Lun, McCarthy, and Marioni 2016). While it contains function for batch correction, doublet detection, estimation of cell cycle states, we will use it primarily for pre-normalization in junction with the scater package (Lun, Bach, and Marioni 2016) and for modelling the mean-variance trend across all genes. Using the trendVar and decomposeVar functions will be used to fit a mean-dependent trend to the gene-specific variances before decomposing the overall variance into technical and biological components. Furthermore, we will use the DM function to calculate the distance of gene-specific squared coefficients of variation (CV2) to a rolling median along mean expression (Kolodziejczyk et al. 2015).
The BASiCS Bioconductor package contains an assembly of functions to estimate and analyse gene- and cell-specific model parameters (Catalina A. Vallejos, Marioni, and Richardson 2015a; Vallejos, Richardson, and Marioni 2016; Eling et al. 2018). BASiCS is build upon a hierarchical Baysian framework and as such samples posterior distribution for each model parameter. In mathematical terms, the gene expression count \(X_{ij}\) for gene \(i\) in cell \(j\) is modelled as:
\[ \begin{aligned} X_{ij}|\mu_i,\nu_j&\sim{}\text{Poisson}(\nu_j\mu_i)\\ \nu_j|s_j,\theta&\sim{}\text{Gamma}(1/\theta,1/(s_j\theta))\\ \rho_{ij}|\delta_i&\sim{}\text{Gamma}(1/\delta_i,1/\delta_i) \end{aligned} \]
where \(\mu_i\) explains the gene’s mean expression, \(\nu_j\) the technical effect dominated by the mRNA capture efficiency \(s_j\) and the unexplained technical noise parameter \(\theta\). Here, \(\phi_j\) is the cell-specifc size factor and \(\rho_{ij}\) the biological random effect incoroporating the over-dispersion hyper-parameter \(\delta_i\). It is important to note that, in the further analysis, \(\mu_i\) represents the mean expression and \(\delta_i\) the biological over-dispersion of each gene.
Due to a strong association between the mean expression and the biological over-dispersion, the model has been extended to correct for such confounding effect. To this end, the prior distribution for the over-dispersion parameters has been changed to:
\[ \delta_i | \mu_i \sim \text{log-t}_{\eta}\left( \text{f}(\mu_i), \sigma^2 \right). \]
where \(\text{f}(\mu_i)\) describes a smooth regression trend between the over-dispersion and the mean expression parameters. This extension introduced the residual over-dispersion parameters \(\epsilon_i=\delta_i-\text{f}(\mu_i)\) that do not scale with mean expression (Eling et al. 2018).
From a data analysis perspective, the BASiCS_MCMC function is the heart of the BASiCS package and can be run in four different settings (see Table 1).
| No regression | Regression | |
|---|---|---|
| Using spike-in reads | WithSpikes = TRUE |
WithSpikes = TRUE |
Regression = FALSE |
Regression = TRUE |
|
| No spike-ins available | WithSpikes = FALSE |
WithSpikes = FALSE |
Regression = FALSE |
Regression = TRUE |
If spike-in counts are availabe and should be used to estimate technical noise, the parameter WithSpikes is set to TRUE (default). If the regression between mean expression and over-dispersion should be performed, the Regression parameters is set to TRUE (default). If the user decides to set Regression = FALSE, BASiCS will not estimate the regression trend and does not supply the residual over-dispersion parameters \(\epsilon_i\).
The BASiCS_MCMC function returns a BASiCS_Chain object, which can be used for further doenstream analysis as indicated in this workflow.
The coda CRAN package contains a variety of functions to assess the convergence of a sampled MCMC chain. To use coda functions, the individual chains returned by BASiCS need to be transformed into a MCMC object that coda recognises using the coda::mcmc function. BASiCS offers a number of functions to visualize the convergence of MCMC chains. Nevertheless, we will use the effectiveSize function from coda to compute the effective sample size of individual MCMC chains.
The goseq Bioconductor package offers functions to detect the enrichment of gene ontologies (GOs) among user-specified gene sets (Young et al. 2010). Furthermore, goseq corrects for gene length biases, which is useful for full length scRNA-Seq data as highlighted in the first section. In this workflow, we will use goseq to detect GO enrichment among differentially expressed sets of genes.
For downstream analysis, such as GO enrichment analysis or the biological interpretation of individual genes, we need to (i) link each gene’s ID to its symbol and (ii) calculate each gene’s length. For the first task, the biomaRt Bioconductor package annotates a wide range of gene and gene product identifiers (Durinck et al. 2005) by accessing the BioMart software suite (http://www.biomart.org). We can use biomaRt to link the Mus musculus gene IDs and to their gene symbols (also referred to as ‘gene name’):
library("ggplot2")
library("biomaRt")
library("EnsDb.Mmusculus.v79")
library("SingleCellExperiment")
library("scran")
library("scater")
library("BASiCS")
library("coda")
library("goseq")
library("org.Mm.eg.db")
library("data.table")
library("pheatmap")
library("ggpointdensity")
## Set theme for ggplot2 plots
theme_set(theme_bw())
## Directories to store downloaded files and intermediate results
dir.create("rds", showWarnings = FALSE)
dir.create("downloads", showWarnings = FALSE)
if (!file.exists("rds/genenames.rds")) {
# Initialize mart
ensembl <- useMart("ensembl")
# Select dataset
ensembl <- useDataset(
dataset = "mmusculus_gene_ensembl",
mart = ensembl
)
# Select gene ID and gene name
genenames <- getBM(
attributes = c("ensembl_gene_id", "external_gene_name"),
mart = ensembl
)
rownames(genenames) <- genenames$ensembl_gene_id
saveRDS(genenames, "rds/genenames.rds")
} else {
genenames <- readRDS("rds/genenames.rds")
}
For the second task, and to perform GO enrichment analysis, we further collected the gene length information of each gene. This can be extracted from the EnsDb.Mmusculus.v79 annotation package. This resource offers gene annotations, such as exon positions and promoters, for the Ensembl data base.
if (!file.exists("rds/genelength.rds")) {
# Build exon list
exons_list <- exonsBy(EnsDb.Mmusculus.v79, by = "gene")
# Calculate summed length of all exons
genelength <- unlist(lapply(
exons_list,
function(x) {
sum(width(reduce(x)))
}
))
saveRDS(genelength, "rds/genelength.rds")
}
genelength <- readRDS("rds/genelength.rds")
# Add gene length to gene names
genenames$gene_length <- genelength[genenames$ensembl_gene_id]
The following R, Bioconductor and package version were used for this workflow:
R version: R version 3.6.1 (2019-07-05)
Bioconductor version: 3.10
Packages: BASiCS 1.8.0, scran 1.14.3, scater 1.14.3, coda 0.19.3, goseq 1.38.0
For the full list of packages used, please see the Session Info.
For the first case study, we will use scRNA-Seq data of CD4+ T cells, which were processed using the C1 Single-Cell Auto Prep System (Fluidigm®) using 10–17 \(\mu{}m\) integrated fluidic circuits (IFCs). Martinez-Jimenez et al. profiled naive and activated CD4+ T cells from young and old animals across two mouse strains to test for changes in expression variability that occur during organismal ageing (Martinez-Jimenez et al. 2017). Naive or effector memory CD4+ T cells were extracted from spleens of young or old animals and filtered using either magnetic-activated cell sorting (MACS) or fluorescence activated cell sorting (FACS) (labelled as MACS-purified Naive, FACS-purified Naive or FACS-purified Effector Memory). For clarification, naive CD4+ T cells are also referred to as ‘unstimulated’ CD4+ T cells.
In addition to profiling naive CD4+ T cells, the authors stimulated half of the naive cells for 3 hours using in vitro antibody stimulation (labelled as Active). Naive as well as activated CD4+ T cells were processed using the C1 Fluidigm® system to capture and lyse cells, and to reverse transcribe and amplify mRNA prior to sequencing. Cells were isolated from B6 (C57BL/6J, Mus musculus domesticus) and CAST (Mus musculus castaneus) animals to profile the evolutionary conservation of transcriptional dynamics during ageing. External spike-in RNA has been added to quantify technical variability across all cells. All experiments were performed in replicates (also referred to as batches) to control for batch effects.
We will begin the workflow with obtaining the data before quality control, running the BASiCS model, and performing further downstream analysis.
The raw counts of the full dataset can be obtain from ArrayExpress under the accession number E-MTAB-4888. In this dataset, the column names contain the library identifier of the original experiment while the row names of the matrix store the individual gene names. The dataset contains reads mapping to ERCC spike-in genes (External RNA Controls Consortium 2005), which BASiCS uses to estimate and remove technical noise.
if (!file.exists("downloads/raw_data.txt")) {
# Download raw counts file
website <- "https://www.ebi.ac.uk/"
folder <- "arrayexpress/files/E-MTAB-4888/"
file <- "E-MTAB-4888.processed.1.zip"
destfile <- "downloads/raw_data.txt.zip"
download.file(
paste0(website, folder, file),
destfile = destfile
)
unzip("downloads/raw_data.txt.zip", exdir = "downloads")
file.remove("downloads/raw_data.txt.zip")
}
# Read in raw data
CD4_raw <- read.table("downloads/raw_data.txt", header = TRUE, sep = "\t")
CD4_raw <- as.matrix(CD4_raw)
# Show row and column names
head(colnames(CD4_raw))
## [1] "do4737" "do4739" "do4740" "do4744" "do4752" "do4755"
head(rownames(CD4_raw))
## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
## [4] "ENSMUSG00000000031" "ENSMUSG00000000049" "ENSMUSG00000000056"
# ERCC spike-in genes
head(rownames(CD4_raw)[grepl("ERCC", rownames(CD4_raw))])
## [1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00012"
## [6] "ERCC-00013"
To link the library identifiers to the experimental conditions, the authors provide a metadata file that can be accessed online under the same accession number.
download_file <- function(file, website, folder, destfile = file) {
if (!file.exists(destfile)) {
download.file(paste0(website, folder, file), destfile)
}
}
if (!file.exists("downloads/metadata_file.txt")) {
# Download raw counts file
website <- "https://www.ebi.ac.uk/"
folder <- "arrayexpress/files/E-MTAB-4888/"
file <- "E-MTAB-4888.additional.1.zip"
destfile <- "downloads/metadata.txt.zip"
download.file(
paste0(website, folder, file),
destfile = destfile
)
unzip("downloads/metadata.txt.zip", exdir = "downloads")
file.remove("downloads/metadata.txt.zip")
}
# Read in metadata file
CD4_metadata <- read.table("downloads/metadata_file.txt", header = TRUE, sep = "\t")
# Save library identifier as rownames
rownames(CD4_metadata) <- CD4_metadata$X
# Show metadata entries
head(CD4_metadata)
## X Strain Age Stimulus Individuals
## do4737 do4737 Mus musculus castaneus Young Unstimulated CAST young 1
## do4739 do4739 Mus musculus castaneus Young Unstimulated CAST young 1
## do4740 do4740 Mus musculus castaneus Young Unstimulated CAST young 1
## do4744 do4744 Mus musculus castaneus Young Unstimulated CAST young 1
## do4752 do4752 Mus musculus castaneus Young Unstimulated CAST young 1
## do4755 do4755 Mus musculus castaneus Young Unstimulated CAST young 1
## Celltype
## do4737 MACS-purified Naive
## do4739 MACS-purified Naive
## do4740 MACS-purified Naive
## do4744 MACS-purified Naive
## do4752 MACS-purified Naive
## do4755 MACS-purified Naive
This metadata file contains the library identifiers (X), the strain information (Strain), the relative age of the animals (Age), the stimulation state of the cells (Stimulus), the batch information indicating the different mice used (Individuals), and the purification method used (Celltype).
TODONE: update with altExp
In the next step, we will first generate a SingleCellExperiment (SCE) object that contains all cells and that stores cell- and gene-specific metadata. This data class offers convenient ways to subset, and to set and retrieve cell- and gene-specific information.
We will also store information on the spike-in molecules using the altExp accessor. This allows us to store alternative experimental assays other than endogenous genes — in this case, technical (spike-in) genes.
# Load library
bio_counts <- CD4_raw[!grepl("ERCC", rownames(CD4_raw)), ]
spike_counts <- CD4_raw[grepl("ERCC", rownames(CD4_raw)), ]
# Generate the SingleCellExperiment object
sce_CD4_all <- SingleCellExperiment(
assays = list(counts = as.matrix(bio_counts)),
colData = CD4_metadata[colnames(CD4_raw), ]
)
altExp(sce_CD4_all, "spike-ins") <- SummarizedExperiment(
assays = list(counts = spike_counts)
)
For further downstream analysis, we select naive and activated CD4+ T cells from young B6 animals that were obtained using the MACS-based cell selection. It is crucial that proper quality control (QC) and gene filtering is performed before running BASiCS, and, in general, before performing meaningful computational analysis on biological data.
ind_select <- sce_CD4_all$Strain == "Mus musculus domesticus" &
sce_CD4_all$Age == "Young" &
sce_CD4_all$Celltype == "MACS-purified Naive"
sce_naive_active <- sce_CD4_all[, ind_select]
Prior to cell-specific quality control, we will remove all genes that are not detected in at least 2 cells across both conditions. Furthermore, we remove genes that are not expressed with an average count of 1 across all cells. These thresholds need to be set dataset-specifically and careful gene-specific quality metrics need to be closely examined as suggested by the SimpleSingleCell Bioconductor workflow (Lun, McCarthy, and Marioni 2016).
# Remove genese)
ind_expressed <- rowSums(counts(sce_naive_active) > 0) > 1 &
rowMeans(counts(sce_naive_active)) >= 1
sce_naive_active <- sce_naive_active[ind_expressed, ]
In the next step, we will use the scran and scater Bioconductor packages for initial normalization and visualization of quality metrics (Lun, McCarthy, and Marioni 2016,McCarthy et al. (2017)). The normalization at this point is needed to avoid biases in the visualization due to differences in the mRNA content between cells. In line with the normalization strategy of BASiCS, we use the spike-in reads for normalization.
# Pre-normalization of data for visualization purposes
sce_naive_active <- computeSpikeFactors(sce_naive_active, spikes = "spike-ins")
sce_naive_active <- logNormCounts(sce_naive_active)
The Biocpkg("SimpleSingleCell") Bioconductor workflow provides an extensive overview on important aspects of how to perform low-level analysis of scRNA-Seq data, including quality control (Lun, McCarthy, and Marioni 2016). The calculateQCMetrics function of the scater package can be used to calculate a number of cell- and gene-specific quality metrics. Furthermore, scater offers the runPCA function to compute a principal component analysis (PCA) across all cells.
# Calculate quality metrics
sce_naive_active <- addPerCellQC(sce_naive_active, use_altexps = TRUE)
sce_naive_active <- addPerFeatureQC(sce_naive_active)
clean_colnames <- make.names(colnames(colData(sce_naive_active)))
colnames(colData(sce_naive_active)) <- clean_colnames
# Calculate PCA
sce_naive_active <- scater::runPCA(sce_naive_active)
For the CD4+ T cell dataset, the authors removed empty capture sites and sites which contained multiple cells or debris as observed by visual inspection. Furthermore, they removed cells which had fewer than 1,000,000 total reads, cells where less than 20% of reads mapped to endogenous genes, cells in which less than 1,250 or more than 3,000 genes were detected and cells with more than 10% or less than 0.5% reads mapping to mitochondrial genes (Martinez-Jimenez et al. 2017).
Since these data have been quality filtered in the original publication, we will only visualize the distribution of quality metrics across the cells. The highlighted quality metrics can otherwise be used to identify low quality cells. These metrics include (among others): number of endogeneous genes detected per cell, total number of reads per cell, the fraction of reads mapping to spike-in and endogeneous genes. The main aim is to identify broken or dying cells, and possibly empty wells. For an extenive discussion on quality metrics for scRNA-Seq data, see Ilicic et al. (Ilicic et al. 2016).
The activation status and batch information for each individual cell of the selected dataset can be seen in Figure 2.
# Visualize the conditions and batch structure
p_stimulus <- scater::plotPCA(sce_naive_active, colour_by = "Stimulus")
p_batch <- scater::plotPCA(sce_naive_active, colour_by = "Individuals")
multiplot(p_stimulus, p_batch, cols = 2)
Figure 2: Visualization of the cell stimulus (left) and the batch information per cell (right)
Cells from two animals were captured either in the naive or activated state.
The data splits into two conditions: naive and activated CD4+ T cells. Furthermore, we do not detect batch effects or outlier cells based on PCA visualization.
In the next step, we will visualize selected cell-specific quality metrics overlayed on the PCA. Figure @ref(fig:no-genes_total-counts) highlights the number of endogeneous genes detected per cell and the total number of counts. Both metrics can be used to detect empty wells or broken cells (low values) and potentially doublets (high values).
# Visualize number of endogeneous genes detected
p_total_features <- scater::plotPCA(
sce_naive_active,
colour_by = "detected"
) +
scale_fill_viridis_c(name = "Number of genes", trans = "log10")
# Visualize log10-transformed total number of counts
p_total_counts <- scater::plotPCA(
sce_naive_active,
colour_by = "sum"
) +
scale_fill_viridis_c(name = "Number of counts", trans = "log10")
multiplot(p_total_features, p_total_counts, cols = 2)
Figure 3: Visualization of the number of biological genes detected per cell (left) and the total number of reads per cell (right)
We detect a higher number of expressed genes and a higher total read count in activated cells. Furthermore, cells within each group show a wide distribution of these quality measures without clear outlying cells.
A widely used quality visualisation is to plot the total number (or fraction) of spike-in counts versus the total number (or fraction) of endogeneous counts (Figure 4). In such a plot, low quality wells are characterised by a high fraction of spike-in counts and a low fraction of endogeneous counts.
# Plot the fraction of reads mapping to spike-ins
# versus the number of reads mapping to endogeneous genes
ggplot(as.data.frame(colData(sce_naive_active))) +
geom_point(aes(sum, altexps_spike.ins_sum)) +
geom_vline(xintercept = 10^5.8, linetype = "dashed", colour = "grey60") +
scale_x_log10() +
scale_y_log10()
Figure 4: The number of spike-in counts per cells is plotted against the number of endogeneous counts per cell
A dashed line indicates the minimum level of spike-in counts, below which cells are removed.
Since we are lacking the information of the total reads per cell (including the non-mapped and intronic reads), we can solely filter cells based on the total number of endogeneous counts (assuming equal numbers of spike-in molecules per well). Nevertheless, we remove two cells that showed unusually low number of reads mapping to endogeneous genes. We indicate the threshold used for filtering on the plot using a dashed line.
# Remove two cells that appear to be outliers
ind_retain <- log10(sce_naive_active$sum) > 5.8
sce_naive_active <- sce_naive_active[, ind_retain]
It is crucial that an in-depth quality control is performed to remove low-quality or outlying cells. Furthermore, such a quality control step can help understand where possible confounding effects arise that can influence the interpretation the results of downstream analyses.
BASiCS requires the absolute molecule count of the spike-in transcripts that were added to each well. To calculate the molecule count, we require the dilution and the concentration of the spike-in mix. For this, a table of the spike-in concentrations can be downloaded from https://www.thermofisher.com. The file contains the concentrations of 2 ERCC spike-in mixes.
For the CD4+ T cell data, the authors added a 1:50,000 dilution of the ERCC spike-in mix 1 (Martinez-Jimenez et al. 2017). We will first download the concentration information.
# Read in the spike-in concentrations
website <- "https://assets.thermofisher.com/"
folder <- "TFS-Assets/LSG/manuals/"
file <- "cms_095046.txt"
download_file(file, website, folder, "downloads/spike_info.txt")
ERCC_conc <- read.table(
"downloads/spike_info.txt",
sep = "\t", header = TRUE
)
The concentration is given in units of attomoles per \(\mu{}l\). We will first calculate the concentration in moles per \(\mu{}l\).
# Moles per micro litre
ERCC_mmul <- ERCC_conc$concentration.in.Mix.1..attomoles.ul. * (10^(-18))
From this, we can calculate the molecule count per \(\mu{}l\) using the fact that 1 mole is equal to \(6.02214076\cdot10^{23}\) molecules.
# Molecule count per micro litre
ERCC_countmul <- ERCC_mmul * (6.02214076 * (10^23))
During the preparation of the reaction mix, the authors diluted this mix by a factor of 1:50,000 (Martinez-Jimenez et al. 2017). The actual molecule number of spike-ins can therefore be calculated per \(\mu{}l\):
ERCC_count <- ERCC_countmul / 50000
The volume per well in the IFC chip is 9 nano litre https://www.fluidigm.com/faq/ifc-9. We therfore need to calculate the number of molecules in 9 nano litre reaction mix.
ERCC_count_final <- ERCC_count * 0.009
Depending on the technology used to capture and process single cells, and the dilution of the spike-in mix, the absolute number of spike-ins per reaction can change. The absolute amount of spike-ins per reaction is purely a scale … BASiCS assumes that the quantity of spike-ins is consistent in each well in each experiment. While the quantity used must remain constant between wells and experiments, the scale does not affect the results, provided the spike-ins are at a reasonable level. In particular, they should not be in such a low concentration that they are rarely detected, and they should not be at such a high concentration that the majority of the sequencing reads map to the spike-in molecules.
TODONE: Add explanation that the number of spike-ins need to be consistent across all wells and experiments but that the scale does not effect the results.
We can now use the molecule count to prepare the BASiCS data object. To incorporate the spike-in molecule counts within the SingleCellExperiment object that BASiCS requires, the first column must contain the names associated to the spike-in genes. The second column must contain the input number of molecules for the spike-in genes (amount per reaction).
# Prepare the data.frame
ERCC_count <- data.frame(
row.names = ERCC_conc$ERCC.ID,
Names = ERCC_conc$ERCC.ID,
count = ERCC_count_final
)
To highlight the use of BASiCS to analyse cells in the single-condition case, we select naive CD4+ T cells of young B6 animals. Here, BASiCS can be used to identify highly- and lowly-variable genes and calculate interpretable gene- and cell-specific parameters. Among others, these include gene-specific parameters for mean expression, over-dispersion (biological variability) and residual over-dispersion (biological variability after correcting for the mean expression effect).
The BASiCS Data object is an object of the class SingleCellExperiment. The newBASiCS_Data function can be used to create the required SingleCellExperiment object based on the following information:
Counts: a matrix of raw expression counts with dimensions \(q\) times \(n\), where \(q\) is the number of genes (technical + biological) and \(n\) is the number of cells. Gene names must be stored as row names of the counts matrix.
Tech: a vector of TRUE/FALSE elements with length \(q\). If Tech[i] = FALSE the gene i is biological; otherwise the gene is spike-in. This vector must be specified in the same order of genes as in the counts matrix.
SpikeInfo: a data.frame with \(q-q_0\) rows where \(q_0\) is the number of biological genes. The first column must contain the names associated to the spike-in genes. The second column must contain the input number of molecules for the spike-in genes (amount per cell).
BatchInfo (optional argument): vector of length \(n\) to indicate batch structure in situations where cells have been processed using multiple batches.
We will first select the naive cells from the SingleCellExperiment object that was generated above.
ind_stimulated <- sce_naive_active$Stimulus == "Unstimulated"
sce_naive <- sce_naive_active[, ind_stimulated]
ind_expressed <- rowSums(counts(sce_naive) > 0) > 1 &
rowMeans(counts(sce_naive)) >= 1
sce_naive <- sce_naive[ind_expressed, ]
Next, we will use the newBASiCS_Data function to re-generate the SingleCellExperiment object for the use with BASiCS.
# Here is the first time that we use BASiCS
# Select the ERCC spike-ins of the dataset
counts <- counts(sce_naive)
spikes <- assay(altExp(sce_naive, "spike-ins"))
spikes_present <- rowSums(spikes) != 0
## Remove spike-ins that are not present from matrix and SCE object
spikes <- spikes[spikes_present, ]
altexp_present <- altExp(sce_naive, "spike-ins")[spikes_present, ]
altExp(sce_naive, "spike-ins") <- altexp_present
ind_spike <- c(rep(FALSE, nrow(counts)), rep(TRUE, nrow(spikes)))
spike_input <- ERCC_count[rownames(spikes), ]
# Generate the SingleCellExperiment object
data_naive <- newBASiCS_Data(
Counts = rbind(counts, spikes),
Tech = ind_spike,
SpikeInfo = spike_input,
BatchInfo = sce_naive$Individuals
)
data_naive
## class: SingleCellExperiment
## dim: 9081 91
## metadata(1): SpikeInput
## assays(1): counts
## rownames(9081): ENSMUSG00000000001 ENSMUSG00000000056 ...
## ENSMUSG00000106643 ENSMUSG00000106648
## rowData names(0):
## colnames(91): do6113 do6118 ... do6398 do6399
## colData names(1): BatchInfo
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): spike-ins
Alternatively, when using datasets that contain spike-in genes, the original SingleCellExperiment object can be extended by simply specifying a BatchInfo slot in the colData object and by adding the SpikeInfo object to the metadata slot.
data_naive <- sce_naive
colData(data_naive)$BatchInfo <- colData(sce_naive)$Individuals
metadata(data_naive)$SpikeInput <- spike_input
After creating the SingleCellExperiment object that contains all information that BASiCS requires, the MCMC sampler can be run to generate posterior estimates of model parameters.
It is recommended to run the BASiCS_MCMC sampler for at least 40,000 iterations to assure convergence. However, if datasets are large and each condition contains hundreds of cells from a homogeneous population, BASiCS can be run with fewer (e.g. 20,000) iterations (see 10X Genomics data). However, for convenience, BASiCS_MCMC can be run with very few (e.g 1,000) iterations to test whether the sampler breaks.
Here, we run the MCMC sampler for 40,000 iterations, using an initial 20,000 iterations as burn in an a thinning step of 20 iterations. Since the dataset contains spike-in genes, we run the sampler with WithSpikes = TRUE and we also want to correct for the mean-variance trend using Regression = TRUE. For further information see Methods.
MCMC_naive <- BASiCS_MCMC(
Data = data_naive,
PrintProgress = FALSE,
N = 40000,
Thin = 20,
Burn = 20000,
Regression = TRUE,
WithSpikes = TRUE
)
This sampler runs for 167 minutes on a 1.4 GHz Intel Core i5 procesor with 4GB RAM and produces a BASiCS_Chain data object. For comparison, this sampler runs for 97 minutes on a 3.4 GHz Intel Core i7 procesor with 16GB RAM. For convenience, the MCMC chain can be obtained online at https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020.
download_file(
file = "MCMC_naive.rds",
website = "https://git.ecdf.ed.ac.uk/vallejosgroup/",
folder = "basicsworkflow2020/raw/master/MCMCs/",
destfile = "rds/MCMC_naive.rds"
)
MCMC_naive <- readRDS("rds/MCMC_naive.rds")
The BASiCS_Chain object contains a list of matrices that store the individual MCMC draws per parameter. Each matrix contains the cell- or gene-specific parameters in the columns and the MCMC draws in the rows. BASiCS provides the displayChainBASiCS function to access the cell- or gene-specific parameters. As an example, we access the first 10 draws for \(\mu_i\) of the first 10 genes.
displayChainBASiCS(MCMC_naive, Param = "mu")[1:2, 1:10]
## ENSMUSG00000000001 ENSMUSG00000000056 ENSMUSG00000000085
## [1,] 10.268068 0.9030355 0.6446042
## [2,] 7.589816 1.5062325 0.5904772
## ENSMUSG00000000088 ENSMUSG00000000131 ENSMUSG00000000134
## [1,] 2.835671 29.79139 2.207230
## [2,] 2.589936 27.75821 1.512179
## ENSMUSG00000000142 ENSMUSG00000000148 ENSMUSG00000000149
## [1,] 1.285121 1.4283485 0.5782942
## [2,] 1.110826 0.7858234 1.2008519
## ENSMUSG00000000168
## [1,] 0.7813204
## [2,] 0.4178972
We ran the MCMC samples for 40,000 iterations, with 20,000 iterations and a thining value of 20. This results in the storage of 1000 draws for each parameter in the BASiCS_Chain object:
dim(displayChainBASiCS(MCMC_naive, Param = "mu"))
## [1] 1000 9081
In the next step, we need to validate the convergence of the chain to ensure robust downstream analysis. There are multiple ways to visualize and test the convergence of MCMC chains (see (Cowles and Carlin 1996), (Brooks and Gelman 1998)). The coda contains diagnostic and plot functions for this task ((Plummer et al. 2006)). Here, we highlight two ways of assessing the convergence of the MCMC sampler by (i) plotting trace plots, sample densities and the chain autocorrelation, and (ii) plotting the the effective sample size across multiple parameters. Trace plots show the sampled parameter values over time. A chain converged when the sample density (in form of a histogram) shows a unimodal distribution. The autocorrelation of an MCMC chain is defined as the Pearson correlation between the chain and a time-delayed version of the chain. The difference in time-points is referred to as ‘lag’. The chain converged if the autocorrelation (except for lag = 1) is small (e.g. < 0.25).
Effective sample size is a measure of the number of independent samples generated for a model parameter (Gelman et al. 2014). Simply, it is defined as the number of samples taken relative to the total autocorrelation. More formally, it is defined as follows:
\[ \mbox{ESS} = \frac{n}{1 + 2\sum_{k=1}^\infty \rho(k)} \] where \(n\) is the number of samples and \(\rho(k)\) is the autocorrelation at lag \(k\). We can visualise this parameter using two
# Convergence of mean expression parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "mu", Gene = 1)
plot(MCMC_naive, Param = "mu", Gene = 7216)
# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "mu")
# Histogram of sample size for mu, as some genes seem to have low values for mu.
BASiCS_DiagHist(MCMC_naive, Param = "mu")
# Convergence of over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "delta", Gene = 100)
plot(MCMC_naive, Param = "delta", Gene = 5000)
# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "delta")
# Convergence of residual over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "epsilon", Gene = 200)
plot(MCMC_naive, Param = "epsilon", Gene = 9000)
# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "epsilon")
# Convergence of mRNA capture efficiency parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_naive, Param = "s", Cell = 10)
plot(MCMC_naive, Param = "s", Cell = 50)
# Effective sample size
BASiCS_DiagPlot(MCMC_naive, Param = "s")
We can see several genes show very low sample size for mu. We can identify these using functions from the CRANpkg(coda).
ess <- effectiveSize(mcmc(displayChainBASiCS(MCMC_naive, "mu")))
low_ess_genes <- names(ess[ess < 100])
hist(
ess,
breaks = "FD",
main = "Effective sample size",
xlab = "Effective sample size"
)
abline(v = 100, lty = "dashed", col = "grey60")
head(low_ess_genes)
## [1] "ENSMUSG00000000958" "ENSMUSG00000001524" "ENSMUSG00000001751"
## [4] "ENSMUSG00000002103" "ENSMUSG00000002280" "ENSMUSG00000002320"
Given that the sample has generated less than 100 effectively independent samples for these genes, there may not be sufficient posterior information to reliably calculate differences in mean expression, and it would be appropriate to exclude these genes from downstream analysis.
In this section, we will highlight the use of BASiCS when cells of a single condition are analysed. The downstream analysis includes: normalization, variance decomposition, detection of highly and lowly variable genes and the use of gene-specific parameters as interpretable variability measures. Furthermore, we will compare the results in the individual steps with results obtained using the Biocpkg("scran").
Posterior estimates of cell-specific parameters can be used to normalize the data and correct for biases in mRNA content (Vallejos et al. 2017). To perform normalization, BASiCS provides the BASiCS_DenoisedCounts and the BASiCS_DenoisedRates functions. TODO: Add more details on the DenoisedCounts function These functions take the SingleCellExperiment and BASiCS_Chain objects as inputs.
counts_denoised <- BASiCS_DenoisedCounts(Data = data_naive, Chain = MCMC_naive)
Alternatively, TODO: add the DenoisedRates
These normalized counts can be further used for dimensionality reduction and clustering as explained elsewhere TODO: ref,ref.
BASiCS offers a function to select genes with large and small biological variance. When the MCMC sampler was run with Regression = TRUE (default), the BASiCS_DetectHVG and BASiCS_DetectLVG functions take the BASiCS_Chain object and a quantile threshold. Using the residual over-disperison parameters, genes are ranked by their variability and the BASiCS_DetectHVG function selects the (for example) top 10% of highly variable genes (PercentileThreshold = 0.9). Similarly, when detecting lowly variable genes, the BASiCS_DetectLVG selects the top 10% of lowly variable genes (PercentileThreshold = 0.1). The propability threshold for a gene showing higher variability than the percentile threshold is found by controling the EFDR to 10% (default).
# Highly variable genes
HVG <- BASiCS_DetectHVG(MCMC_naive, PercentileThreshold = 0.9)
# Lowly variable genes
LVG <- BASiCS_DetectLVG(MCMC_naive, PercentileThreshold = 0.1)
This analysis results in the detection of 265 highly variable genes and 778 lowly variable genes.
The Biocpkg("scran") provides similar functions to detect HVGs and we can compare the results of both methods. scran first fits a smooth regression between the variance of the log-transformed expression of the spike-in transcripts and their mean abundance using the trendVar function. Afterwards, the decomposeVar function decomposes the gene-specific variance into a biological and technical component.
# Fit the mean-variance trend
means <- rowMeans(logcounts(sce_naive))
vars <- rowVars(logcounts(sce_naive))
var_fit <- scran::fitTrendVar(means, vars)
# Variance decomposition
var_out <- scran::modelGeneVar(sce_naive)
As proposed by the SimpleSingleCell workflow, the HVGs are described as displaying a biolgical variance component of larger than 0.5 while controlling the FDR to 5% (Lun, McCarthy, and Marioni 2016). TODO: Make sure this is still in the current version of SimpleSingleCell. We can then compare the the overlap of the HVG identified by scran and by BASiCS.
ind_hvg <- which(var_out$FDR <= 0.05 & var_out$bio >= 0.5)
hvg_out <- var_out[ind_hvg, ]
hvg_out <- hvg_out[order(hvg_out$bio, decreasing = TRUE), ]
nrow(hvg_out)
## [1] 10
# Intersection between BASiCS and scran HVG
length(intersect(rownames(hvg_out), HVG$Table$GeneName[HVG$Table$HVG]))
## [1] 4
# Compare BASiCS and scran results
HVG_res <- paste(
HVG$Table$HVG,
var_out[HVG$Table$GeneName, "FDR"] <= 0.05 &
var_out[HVG$Table$GeneName, "bio"] >= 0.5
)
ind_ff <- HVG_res == "FALSE FALSE"
ind_notff <- !which(ind_ff)
ggplot() +
geom_point(
data = data.frame(
Mu = HVG$Table$Mu[ind_ff],
Epsilon = HVG$Table$Epsilon[ind_ff],
HVG = HVG_res[ind_ff]
),
aes(log(Mu), Epsilon, colour = HVG)
) +
geom_point(
data = data.frame(
Mu = HVG$Table$Mu[ind_notff],
Epsilon = HVG$Table$Epsilon[ind_notff],
HVG = HVG_res[ind_notff]
),
aes(log(Mu), Epsilon, colour = HVG)
)
All TODO: XYZ HVG genes of scran are among the HVGs identified by BASiCS.
TODO: Move this up
Numerous studies have highlighted the relationship between gene-specific variability measures such as the squared coefficient of variation and mean abundance TODO: ref,ref,ref. BASiCS provides the BASiCS_ShowFit function that plots the gene-specific over-dispersion parameters (delta) versus mean expression parameters (mu).
BASiCS_ShowFit(MCMC_naive)
Here, we observe that the over-dispersion estimates are negatively associated with mean expression. However, by performing the regression between over-dispersion and mean expression, we can correct for this trend and obtain variability measures that show no correlation with mean expression (Eling et al. 2018). The purple points in the plot indicate genes that are not expressed in at least 2 cells. BASiCS automatically excludes these genes due to challenges in interpreting variability estimates and changes in variability for genes that are only expressed in one cell.
We can now compare the gene-specific variability measures (over-dispersion and residual over-dispersion) to previously used measures to quantify cell-to-cell expression variability (Brennecke et al. 2013,Ola 2015).
Widely used measures of expression variability include the variance (Shalek et al. 2014), the Fano factor (variance divided by mean expression) https://www.nature.com/articles/nbt.2642 https://www.nature.com/articles/nmeth.2930 https://link.springer.com/article/10.1007/s00216-008-2431-z TODO: refs and the coefficient of variation (CV, variance divided by squared mean expression) https://www.nature.com/articles/nbt.3102 https://www.nature.com/articles/nmeth.2645 TODO: refs. Here, we will highlight the mean-variance realtionship for each variability measures. For this analysis, we exclude genes that are not expressed in at least 2 cells. These genes can be identified in the displayChainBASiCS(MCMC_naive, Param = "epsilon") to contain NA.
# Exclude ERCCs
counts_denoised <- counts_denoised[!grepl("ERCC", rownames(counts_denoised)), ]
# Exclude genes
eps_not_na <- !is.na(displayChainBASiCS(MCMC_naive, Param = "epsilon")[1, rownames(counts_denoised)])
counts_denoised <- counts_denoised[eps_not_na, ]
# Variance
var_genes <- apply(counts_denoised, 1, var)
mean_genes <- apply(counts_denoised, 1, mean)
ggplot(
data.frame(
variance = var_genes,
mean = mean_genes
)
) +
geom_point(aes(log(mean), log(variance)))
# Fano factor
ggplot(
data.frame(
fano = var_genes / mean_genes,
mean = mean_genes
)
) +
geom_point(aes(log(mean), log(fano)))
# Squared coefficient of variation
ggplot(
data.frame(
CV2 = var_genes / mean_genes^2,
mean = mean_genes
)
) +
geom_point(aes(log(mean), log(CV2)))
We see that all measures correlate with mean expression. The same is true for the over-dispersion parameters as estimated by BASiCS as shown below. Again, for this comparison, we exclude genes that do not show expression in at least 2 cells
summary_naive <- Summary(MCMC_naive)
# Remove genes
eps_not_na <- !is.na(displaySummaryBASiCS(summary_naive, Param = "epsilon")[, "median"])
mu_naive <- displaySummaryBASiCS(summary_naive, Param = "mu")[eps_not_na, ]
delta_naive <- displaySummaryBASiCS(summary_naive, Param = "delta")[eps_not_na, ]
# Over-dispersion versus mean expression
ggplot(
data.frame(
mu = mu_naive[, "median"],
delta = delta_naive[, "median"]
)
) +
geom_point(aes(log(mu), log(delta)))
# Compare delta to CV2
ggplot(
data.frame(
CV2 = var_genes / mean_genes^2,
delta = delta_naive[rownames(counts_denoised), "median"]
)
) +
geom_point(aes(log(CV2), log(delta)))
The over-dispersion parameters estimated using BASiCS show strong correlation with the CV2. Recently, we extended BASiCS to avoid the mean-variability relationship by performing an internal regression between the over-dispersion and mean expression parameters (as visualized in figure Figure showFit). Similarly, Kolodziejczyk et al. used the distance to a rolling median (DM) along the mean-variability trend to correct for this confounding factor (Kolodziejczyk et al. 2015). Here, we highlight how to obtain the residual variability estimates using BASiCS and scran. TODO: Add density to these plots
# Residual over-dispersion estimates
epsilon_naive <- displaySummaryBASiCS(summary_naive, Param = "epsilon")[eps_not_na, ]
# Residual over-dispersion versus mean expression
ggplot(
data.frame(
mu = mu_naive[, "median"],
epsilon = epsilon_naive[, "median"]
)
) +
geom_pointdensity(aes(log(mu), epsilon))
# DM values
DM.naive <- scran::DM(mean = mean_genes, cv2 = var_genes / mean_genes^2)
# DM versus mean expression
ggplot(
data.frame(
mean = mean_genes,
DM = DM.naive
)
) +
geom_pointdensity(aes(log(mean), DM))
# Compare residual over-dispersion and DM
ggplot(
data.frame(
epsilon = epsilon_naive[rownames(counts_denoised), "median"],
DM = DM.naive
)
) +
geom_pointdensity(aes(epsilon, DM))
Neither the DM nor the residual over-dispersion estimates show association with mean expression. Furthermore, the mean-independent variability measures display high correlation. These measures can be used to associate genomic features (Morgan and Marioni 2018,Faure, Schmiedel, and Lehner (2017)) or transcriptional dynamics (Antolović et al. 2017) to gene expression variability. While the DM is calculated as a point estimate, BASiCS stores each posterior sample within the BASiCS_Chain object. They can be accessed using the displayChain function, which displays cell- or gene-specific samples in form of a matrix where each column contains cell- or gene-specific paramters and rows contain the MCMC samples.
displayChainBASiCS(MCMC_naive, Param = "epsilon")[1:10, 1:10]
## ENSMUSG00000000001 ENSMUSG00000000056 ENSMUSG00000000085
## [1,] 0.012143973 -0.1074785 0.19959531
## [2,] 0.015216718 -0.1704527 0.19384074
## [3,] -0.035302394 0.1348456 0.20584254
## [4,] 0.008019817 -0.1212118 0.24030462
## [5,] -0.185304787 0.3296048 0.02219366
## [6,] 0.194610900 -0.1084879 0.26642177
## [7,] -0.234065310 -0.4757059 0.15923177
## [8,] -0.281569011 -0.2063406 0.18243192
## [9,] -0.037017589 -0.5195024 0.20142782
## [10,] 0.208842610 0.6230923 -0.12105013
## ENSMUSG00000000088 ENSMUSG00000000131 ENSMUSG00000000134
## [1,] -0.7213403 0.11407893 -0.04950098
## [2,] -0.8908909 -0.01500526 0.25342026
## [3,] -0.3769258 -0.05334287 0.50721027
## [4,] -0.7141363 -0.09503815 0.05465544
## [5,] -0.8610063 -0.05786627 -0.01770596
## [6,] -0.4166708 0.19077485 -0.48944947
## [7,] -0.6854039 0.25076432 -0.32808203
## [8,] -0.6895433 0.07555601 0.08461211
## [9,] -0.7400513 0.02100549 -0.19116327
## [10,] -0.6213723 -0.02283524 0.06427608
## ENSMUSG00000000142 ENSMUSG00000000148 ENSMUSG00000000149
## [1,] -0.25296360 0.7634435 0.19063083
## [2,] 0.38063198 0.1620200 -0.55605683
## [3,] -0.34126870 0.3717319 0.19398426
## [4,] -0.11014405 -0.2204141 0.32226516
## [5,] -0.19223586 0.5475225 0.07721988
## [6,] 0.02059824 -0.2837429 -0.20374831
## [7,] -0.60487739 0.1258938 0.05140366
## [8,] 0.15072027 -0.3777553 -0.14021451
## [9,] -0.22375198 -0.1868961 0.58213775
## [10,] -0.17681535 0.1484235 0.05341227
## ENSMUSG00000000168
## [1,] 0.01351666
## [2,] -0.24607807
## [3,] -0.10369271
## [4,] -0.33172934
## [5,] -0.52632335
## [6,] -0.51124415
## [7,] -0.31702236
## [8,] 0.18183497
## [9,] 0.12331260
## [10,] -0.47013036
TODO: Not sure if we need this By testing a certain association (for example between CpG island length and variability (Morgan and Marioni 2018)) for each MCMC sample, one can generate a post-hoc posterior distribution of the test statistic.
The workflow so far highlights the use of BASiCS for analysing cells of a single condtion.
This section highlights the use of BASiCS to perform differential testing (mean and variability) between cells of two condtions. For convenience, we will compare the naive CD4+ T cells, which were analysed in the previous section to activated CD4+ T cells of the same dataset (Martinez-Jimenez et al. 2017). Naive CD4+ T cells were activated for 3 hours using plate-bound CD3e and CD28 antibodies. T cell activation is linked to strong transcriptional shifts and the up-regulation of lineage specific marker genes, such as Tbx21 and Gata1 TODO ref, ref. To generate this data, the authors did not add cytokines, which are needed for T cell differentiations TODO: ref, meaning that any heterogeneity in the activated cell population does not arise from cells residing in different lineage-specific differentiation states. Prior to differential testing, and as explained above, we need to generate a SingleCellExperiment object that is compatible for processing using BASiCS.
We have performed quality control on the naive and activated CD4+ T cells above when preparing the BASiCS_Data object. Therefore, we can directly select the activated CD4+ T cells from the sce_naive_active object.
ind_active <- sce_naive_active$Stimulus == "Active"
sce_active <- sce_naive_active[, sce_naive_active$Stimulus == "Active"]
Similar to the procedure described above in the single condition example, we will use the newBASiCS_Data function to re-generate the SingleCellExperiment object for the use with BASiCS.
## select the biological genes of the dataset
counts <- counts(sce_active)
## Select the ERCC spike-ins of the dataset
spikes <- assay(altExp(sce_active, "spike-ins"))
spikes_present <- rowSums(spikes) != 0
## Remove spike-ins that are not present from matrix and SCE object
spikes <- spikes[spikes_present, ]
altexp_present <- altExp(sce_active, "spike-ins")[spikes_present, ]
altExp(sce_active, "spike-ins") <- altexp_present
ind_spike <- c(rep(FALSE, nrow(counts)), rep(TRUE, nrow(spikes)))
spike_input <- ERCC_count[rownames(spikes), ]
# Generate the SingleCellExperiment object
data_active <- newBASiCS_Data(
Counts = rbind(counts, spikes),
Tech = ind_spike,
SpikeInfo = spike_input,
BatchInfo = sce_active$Individuals
)
## Subset to common genes with naive
data_active <- data_active[rownames(data_naive)]
We can use this SingleCellExperiment object as an input to BASiCS_MCMC and run the MCMC sampler over 40,000 iterations.
MCMC_active <- BASiCS_MCMC(
Data = data_active,
PrintProgress = FALSE,
N = 40000,
Thin = 20,
Burn = 20000,
Regression = TRUE,
WithSpikes = TRUE
)
This sampler runs for 98 minutes on a 1.4 GHz Intel Core i5 processor with 4GB RAM and produces a BASiCS_Chain data object. The same sampling run completed in 59 minutes on a 3.4 GHz Intel Core i7 processor with 16GB RAM. For convenience, this MCMC chain can be again obtained online at https://git.ecdf.ed.ac.uk/vallejosgroup/basicsworkflow2020/.
download_file(
file = "MCMC_active.rds",
website = "https://git.ecdf.ed.ac.uk/vallejosgroup/",
folder = "basicsworkflow2020/raw/master/MCMCs/",
destfile = "rds/MCMC_active.rds"
)
MCMC_active <- readRDS("rds/MCMC_active.rds")
Similar to the BASiCS_Chain quality checks described above, we will again profile the convergence of the chain using a visual inspection of the trace plots, sample histograms and autocorrelation of individual chains. Furthermore, we will use the BASiCS_DiagPlot function to assess the effective sample size of all chains per parameter class.
# Convergence of mean expression parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "mu", Gene = 1)
plot(MCMC_active, Param = "mu", Gene = 1000)
# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "mu")
# Histogram of effective sample size for mu, as some genes seem to have low values for mu.
BASiCS_DiagHist(MCMC_active, Param = "mu")
# Convergence of over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "delta", Gene = 100)
plot(MCMC_active, Param = "delta", Gene = 5000)
# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "delta")
# Convergence of residual over-dispersion parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "epsilon", Gene = 200)
plot(MCMC_active, Param = "epsilon", Gene = 5000)
# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "epsilon")
# Convergence of mRNA capture efficiency parameters
# Trace plot, sample histogram and autocorrelation
plot(MCMC_active, Param = "s", Cell = 10)
plot(MCMC_active, Param = "s", Cell = 50)
# Effective sample size
BASiCS_DiagPlot(MCMC_active, Param = "s")
To highlight the regression trend, we can use the BASiCS_ShowFit function.
BASiCS_ShowFit(MCMC_active)
The MCMC sampler converged and the regression captured the full range of data points similar to the regression done on naive CD4+ T cells. We can therefore move on to perform differential testing between naive and activated CD4+ T cells.
BASiCS_DiagHist(MCMC_active, Param = "mu")
BASiCS_DiagHist(MCMC_active, Param = "delta")
BASiCS_DiagPlot(MCMC_active, Param = "mu")
BASiCS_DiagPlot(MCMC_active, Param = "delta")
To perform robust differential mean expression testing, BASiCS removes genes with small effective sample size from EFDR calibration and differential expression testing. As explained above, this can be done using functions from the CRANpkg(coda) package.
In total, we exlude 238 genes from testing due to low effective sample size when sampling the posterior distribution.
The default settings for differential mean expression testing are as follows:
EpsilonM: Log2 fold change (LFC) threshold for changes in mean expression: \(\log_2(1.5)\approx0.41\) EFDR_M: Expected false discovery rate: 10% Plot, PlotOffset: Boolean to control if results are plotted: TRUE
# Perform differential testing
Test_DE <- BASiCS_TestDE(
Chain1 = MCMC_naive,
Chain2 = MCMC_active,
EpsilonM = log2(1.5),
GroupLabel1 = "Naive",
GroupLabel2 = "Active",
Plot = FALSE,
PlotOffset = FALSE,
CheckESS = TRUE,
MinESS = 100
)
After running the test, we can now visaulize the results in form of a MA-plot (log ratio versus mean average) and volcano plot (posterior probability versus log ratio).
colour_scale <- scale_color_manual(
values = c(
"Naive+" = "dark blue",
"Active+" = "dark red",
"NoDiff" = "black",
"ExcludedLowESS" = "grey60",
"ExcludedByUser" = "grey80"
)
)
# Visualize MA plot
ggplot(Test_DE$TableMean) +
geom_point(aes(log(MeanOverall), MeanLog2FC, colour = ResultDiffMean)) +
colour_scale
# Visualize MA plot
ggplot(Test_DE$TableMean) +
geom_point(aes(MeanLog2FC, ProbDiffMean, colour = ResultDiffMean)) +
colour_scale
As we can see for the comparison of naive and activated CD4+ T cells, most genes show strong differences in mean expression. In such cases, it can be beneficial to increase the LFC threshold or to decrease the threshold for the EFDR. Here, we therfore set the LFC threshold to \(\log_2(2)=1\) to detect genes with strong changes in mean expression. We also set MinESS to 100. This causes genes with effective sample size less than 100 in either input chain to be excluded from EFDR calibration and differential expression testing. In this case, this results in 249 genes with low effective sample size being excluded.
# Perform differential testing
Test_DE <- BASiCS_TestDE(
Chain1 = MCMC_naive,
Chain2 = MCMC_active,
EpsilonM = log2(2),
GroupLabel1 = "Naive",
GroupLabel2 = "Active",
Plot = FALSE,
PlotOffset = FALSE,
CheckESS = TRUE,
MinESS = 100
)
table(Test_DE$TableMean$ResultDiffMean)
##
## Active+ Naive+ NoDiff
## 1149 1146 6786
To understand the regulatory programmes that underlie T cell activation, a variety of downstream analyses can be peformed using the differentially expressed genes. Here, we highlight the use of gene ontology (GO) analysis to group genes that are differentially expressed between naive and activated CD4+ T cells. For this, we use the Bioconductor Biocpkg(goseq) package and the Biocpkg(org.Mm.eg.db) annotation package to test GO enrichment within genes specifically expressed by naive or activated CD4+ T cells
# Collect significan genes as 1 and all other as 0
naive_genes <- as.integer(Test_DE$TableMean$ResultDiffMean == "Naive+")
names(naive_genes) <- Test_DE$TableMean$GeneName
# Build a Null distribution by correcting the gene length bias
pwf <- nullp(naive_genes, "mm10", "ensGene",
bias.data = genenames[names(naive_genes), "gene_length"]
)
GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
naive_GO <- DataFrame(GO_wall[ind_signif, ])
# Add genenames to the GO categories
all_genes <- vector(length = nrow(naive_GO))
for (i in 1:nrow(naive_GO)) {
allegs <- get(naive_GO$category[i], org.Mm.egGO2ALLEGS)
genes <- unique(unlist(mget(allegs, org.Mm.egENSEMBL)))
genes <- as.character(intersect(
genes,
Test_DE$TableMean$GeneName[naive_genes == 1]
))
all_genes[i] <- paste(genenames[genes, "external_gene_name"],
collapse = ", "
)
}
naive_GO$gene <- all_genes
# Show GO categories
naive_GO$term
## [1] "response to stimulus"
## [2] "signaling"
## [3] "cell communication"
## [4] "immune system process"
## [5] "signal transduction"
## [6] "regulation of response to stimulus"
## [7] "intracellular signal transduction"
## [8] "cellular response to stimulus"
## [9] "response to external stimulus"
## [10] "regulation of biological process"
## [11] "biological regulation"
## [12] "cell periphery"
## [13] "plasma membrane part"
## [14] "plasma membrane"
## [15] "T cell activation"
## [16] "regulation of localization"
## [17] "negative regulation of biological process"
## [18] "cell activation"
## [19] "defense response"
## [20] "negative regulation of response to stimulus"
## [21] "leukocyte activation"
## [22] "regulation of signal transduction"
## [23] "regulation of signaling"
## [24] "regulation of cell communication"
## [25] "regulation of cell adhesion"
## [26] "cell adhesion"
## [27] "cytokine production"
## [28] "immune response"
## [29] "biological adhesion"
## [30] "lymphocyte activation"
## [31] "adaptive immune response"
## [32] "regulation of cellular component movement"
## [33] "cell surface receptor signaling pathway"
## [34] "negative regulation of cellular process"
## [35] "regulation of cellular process"
## [36] "protein binding"
## [37] "regulation of multicellular organismal process"
## [38] "negative regulation of multicellular organismal process"
## [39] "regulation of immune system process"
## [40] "cellular developmental process"
## [41] "leukocyte cell-cell adhesion"
## [42] "regulation of intracellular signal transduction"
## [43] "G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger"
## [44] "regulation of cytokine production"
## [45] "positive regulation of response to stimulus"
## [46] "cell migration"
## [47] "cell motility"
## [48] "localization of cell"
## [49] "adenylate cyclase-modulating G protein-coupled receptor signaling pathway"
## [50] "hemopoiesis"
## [51] "locomotion"
## [52] "regulation of cell migration"
## [53] "response to biotic stimulus"
## [54] "response to stress"
## [55] "cAMP-mediated signaling"
## [56] "lipid catabolic process"
## [57] "positive regulation of cell adhesion"
## [58] "regulation of locomotion"
## [59] "regulation of cell motility"
## [60] "hematopoietic or lymphoid organ development"
## [61] "regulation of ion transport"
## [62] "response to chemical"
## [63] "membrane"
## [64] "actin filament-based process"
## [65] "positive regulation of immune system process"
## [66] "response to external biotic stimulus"
## [67] "response to other organism"
## [68] "positive regulation of biological process"
## [69] "leukocyte differentiation"
## [70] "cell-cell adhesion"
## [71] "movement of cell or subcellular component"
## [72] "multicellular organismal process"
## [73] "G protein-coupled receptor signaling pathway"
## [74] "T cell selection"
## [75] "cyclic-nucleotide-mediated signaling"
## [76] "positive regulation of ion transport"
## [77] "regulation of cell-cell adhesion"
## [78] "positive regulation of cellular process"
## [79] "alpha-beta T cell receptor complex"
## [80] "T cell receptor complex"
## [81] "adenylate cyclase-activating G protein-coupled receptor signaling pathway"
## [82] "inflammatory response"
## [83] "cytoskeleton organization"
## [84] "immune system development"
## [85] "lymphocyte differentiation"
## [86] "cell differentiation"
## [87] "immune effector process"
## [88] "regulation of metal ion transport"
## [89] "T cell differentiation"
## [90] "regulation of molecular function"
## [91] "regulation of leukocyte cell-cell adhesion"
## [92] "regulation of defense response"
## [93] "actin cytoskeleton organization"
## [94] "actin filament organization"
## [95] "cellular response to organic substance"
## [96] "cell death"
## [97] "innate immune response"
## [98] "response to organic substance"
## [99] "plasma membrane receptor complex"
## [100] "leukocyte migration"
## [101] "side of membrane"
## [102] "actin binding"
## [103] "enzyme binding"
## [104] "regulation of developmental process"
## [105] "membrane part"
## [106] "cellular response to chemical stimulus"
## [107] "developmental process"
## [108] "regulation of cell death"
## [109] "positive regulation of cell-cell adhesion"
## [110] "positive regulation of developmental process"
## [111] "calcium ion transport"
## [112] "metal ion transport"
## [113] "regulation of immune response"
## [114] "positive regulation of multicellular organismal process"
## [115] "animal organ development"
## [116] "positive regulation of leukocyte cell-cell adhesion"
## [117] "regulation of transmembrane transport"
## [118] "thymic T cell selection"
## [119] "interferon-gamma production"
## [120] "regulation of interferon-gamma production"
Genes that are down-regulated during immune activation are enriched for cell surface receptor signalling related categories. This is in line with the well-known down-regulation of the T-cell receptor (TCR) complex upon activation (Jose et al. 2000). Here, we find components of the TCR complex (such as Cd28, Cd3e, Cd3g, Cd4; all part of the ‘cell surface receptor signaling pathway’ GO category) to show higher expression in naive cells compared to activated CD4+ T cells.
We can also test the enrichment of GO categories among genes that show increased expression upon T cell activation.
# Collect significan genes as 1 and all other as 0
active_genes <- as.integer(Test_DE$TableMean$ResultDiffMean == "Active+")
names(active_genes) <- Test_DE$TableMean$GeneName
# Build a Null distribution by correcting the gene length bias
pwf <- nullp(active_genes, "mm10", "ensGene",
bias.data = genenames[names(active_genes), "gene_length"]
)
GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
active.GO <- DataFrame(GO_wall[ind_signif, ])
# Add genenames to the GO categories
all_genes <- vector(length = nrow(active.GO))
for (i in 1:nrow(active.GO)) {
allegs <- get(active.GO$category[i], org.Mm.egGO2ALLEGS)
genes <- unique(unlist(mget(allegs, org.Mm.egENSEMBL)))
genes <- as.character(intersect(
genes,
Test_DE$TableMean$GeneName[active_genes == 1]
))
all_genes[i] <- paste(genenames[genes, "external_gene_name"],
collapse = ", "
)
}
active.GO$gene <- all_genes
# Show GO categories
head(active.GO$term)
## [1] "ribonucleoprotein complex biogenesis"
## [2] "ribosome biogenesis"
## [3] "nucleolus"
## [4] "ncRNA metabolic process"
## [5] "rRNA processing"
## [6] "rRNA metabolic process"
Genes that are up-regulated upon activation are dominated by GO categories that are associated to translational and metabolic processes. It has been shown that within the first 24 hours of T cell activation, cells start to up-regulate components involved in translation, ribosome biogenesis and profileration (Tan et al. 2017,Araki et al. (2017)). This signature can be recovered by performing differential mean expression analysis between naive and activated CD4+ T cells.
While other computational tools exists to perform differential mean expression analysis, we next want to highlight the use of BASiCS for differential variability testing.
Due to the negative association between the over-dispersion and mean expression parameters, only genes that do not show a change in mean expression. To avoid any mean expression confounding effect, we perform the differential testing by setting the LFC threshold on mean expression to EpsilonM = 0 while using the default LFC threshold on changes in over-dispersion: EpsilonD = log2(1.5). We furthermore, it is crucial to exclude lowly expressed genes from this analysis to avoid biases arising from non-informative genes.
# Select genes that show expression in both conditions
genes_select <- (Test_DE$TableMean$Mean1 > 1 & Test_DE$TableMean$Mean2 > 1)
Test_DE_LFC0 <- BASiCS_TestDE(
Chain1 = MCMC_naive,
Chain2 = MCMC_active,
EpsilonM = 0,
GroupLabel1 = "Naive",
GroupLabel2 = "Active",
Plot = FALSE,
PlotOffset = FALSE,
CheckESS = TRUE,
MinESS = 100,
GenesSelect = genes_select
)
We first select the genes that remain similarly expressed between both conditions and highligh the differential over-dispersion results in form of MA- and boxplots.
# Select genes with no changes in mean expression
ind_nochange <- Test_DE_LFC0$TableMean$ResultDiffMean == "NoDiff"
# MA plot
ggplot(Test_DE_LFC0$TableDisp[ind_nochange, ]) +
geom_point(aes(log(MeanOverall), DispLog2FC,
colour = ResultDiffDisp
)) +
scale_color_manual(values = c(
"Naive+" = "dark blue",
"Active+" = "dark red",
"NoDiff" = "black"
))
# Boxplot
wilcox.test(Test_DE_LFC0$TableDisp$DispLog2FC[ind_nochange])
##
## Wilcoxon signed rank test with continuity correction
##
## data: Test_DE_LFC0$TableDisp$DispLog2FC[ind_nochange]
## V = 2235806, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
boxplot(Test_DE_LFC0$TableDisp$DispLog2FC[ind_nochange],
ylab = "LFC in over-dispersion", outline = FALSE
)
abline(a = 0, b = 0, lwd = 2, col = "dark red")
wilcox.test(Test_DE_LFC0$TableDisp$DispLog2FC[ind_nochange])
##
## Wilcoxon signed rank test with continuity correction
##
## data: Test_DE_LFC0$TableDisp$DispLog2FC[ind_nochange]
## V = 2235806, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
With this analysis, we detect increased over-dispersion in naive CD4+ T cells for genes that show similar expression levels between naive and activated CD4+ T cells.
While the analysis in the previous section is well suited to detect global changes in variability (e.g. detecting if one cell population overall displays higher expression variabilty), it does not allow the testing of changes in mean expression and expression variability in parallel. For this, BASiCS compares the residual over-dispersion parameters, which do not scale with mean expression, between the two conditions. Here, we filter on genes that are lowly expressed in both conditions and, as explained above, remove genes for with the MCMC sampler leads to a low effective sample size:
# Remove lowly expressed genes
low_expr <- !(Test_DE$TableMean$Mean1 < 1 & Test_DE$TableMean$Mean2 < 1)
# Genes to exclude
genes_select <- low_expr
We can now perform the differential testing as shown above. Again, we use a LFC threshold higher than the default to capture strong changes in mean expression.
# Perform differential testing
Test_DE <- BASiCS_TestDE(
Chain1 = MCMC_naive,
Chain2 = MCMC_active,
EpsilonM = log2(2),
GroupLabel1 = "Naive",
GroupLabel2 = "Active",
CheckESS = TRUE,
MinESS = 100,
Plot = FALSE,
PlotOffset = FALSE,
GenesSelect = genes_select
)
We can now visualize the changes in residual over-dispersion between naive and activated CD4+ T cells in form of a MA-plot. In this visualization, the difference between the posterior medians of the residual over-dispersion parameters \(\epsilon\) is shown on the y-axis. Epsilon values for genes that are not expressed in at least 2 cells per conditions are marked as NA and are therefore not being displayed.
ggplot(Test_DE$TableResDisp) +
geom_point(
aes(
x = log(MeanOverall),
y = ResDispDistance,
colour = ResultDiffResDisp
)
) +
scale_color_manual(
values = c(
"Naive+" = "dark blue",
"Active+" = "dark red",
"NoDiff" = "black",
"ExcludedByUser" = "grey",
"ExcludedFromTesting" = "grey50"
)
)
While one could perform GO analysis (as explained above) on the genesets that change in residual over-dispersion, here, we want to highlight how to analyse changes in mean expression in parallel to changes in variability. For this, we will first combine the results of the differential mean expression and the differential residual over-dispersion test. We will further remove the genes that were excluded from the test and those that are not expressed in at least 2 cells in either condition.
res.df <- cbind(Test_DE$TableMean, Test_DE$TableResDisp[, -c(1, 2)])
ind_exclude <- res.df$ResultDiffResDisp == "ExcludedByUser" |
res.df$ResultDiffResDisp == "ExcludedFromTesting"
res.df <- res.df[!ind_exclude, ]
Next, we can visualize the regulation of each individual gene based on its changes in mean expression and expression variability.
ggplot(res.df) +
geom_point(aes(MeanLog2FC, ResDispDistance,
colour = interaction(ResultDiffMean, ResultDiffResDisp)
))
While we can now test each of the nine genesets for functional enrichment, here, we are particularly interested in the set of genes that are up-regulated with increased variability upon immune activation. Recently, it has been shown that specifically cytokines show heterogeneous expression in active immune cells and diverge between species (Hagai et al. 2018). We are therefore interested which immune response genes show variable activation patterns in early CD4+ T cell activation. For this, we will first collect all immune response genes (GO category: GO:0006955) using the goseq package.
# Select immune response genes
immun.allegs <- get("GO:0006955", org.Mm.egGO2ALLEGS)
immun.genes <- unique(unlist(mget(immun.allegs, org.Mm.egENSEMBL)))
immun.names <- genenames[immun.genes, "external_gene_name"]
Finally, we will find the subset of immune response genes that are variably up-regulated upon early immune activation
# Variable response genes
ind_var.response <- res.df$ResultDiffMean == "Active+" &
res.df$ResultDiffResDisp == "Active+"
var.response.genes <- genenames[
res.df$GeneName[ind_var.response],
"external_gene_name"
]
intersect(var.response.genes, immun.names)
## [1] "Tnf" "Itm2a" "Smad3" "Adam17" "Tnip2"
ind_var.response <- res.df$ResultDiffMean == "Active+" &
res.df$ResultDiffResDisp == "Naive+"
var.response.genes <- genenames[
res.df$GeneName[ind_var.response],
"external_gene_name"
]
intersect(var.response.genes, immun.names)
## [1] "Bmi1" "Tnfsf8"
Pou2f2 and Smad3 .
While immune activation induces large transcriptional shifts in CD4+ T cells, we will now exemplify the use of BASiCS on a system that shows more subtle transcriptional changes during differentiation.
With the development of droplet-based scRNA-Seq (Klein et al. 2015, Macosko et al. (2015)) lead to a strong increase in the number of cells that can be profiled per experiment. With this, large-scale scRNA-Seq datasets have been generated to study development across multiple time-points and capturing musltiple tissues (Ibarra-Soria et al. 2018, Kernfeld et al. (2018)). Here we describe the computational analysis of changes in mean expression and transcriptional variability when data is sparse and technical spike-in genes are missing. For this, we compare cells of the presomitic mesoderm and somitic mesoderm using droplet-based scRNA-Seq data (Ibarra-Soria et al. 2018).
The full dataset is stored under the accession number E-MTAB-6153 on ArrayExpress and can be obtained via:
if (!file.exists("downloads/rawCounts.tsv")) {
website <- "https://www.ebi.ac.uk/"
folder <- "arrayexpress/files/E-MTAB-6153/"
file <- "E-MTAB-6153.processed.3.zip"
download.file(
paste0(website, folder, file),
destfile = "downloads/rawCounts.zip"
)
unzip(zipfile = "downloads/rawCounts.zip", exdir = "downloads")
unlink("downloads/rawCounts.zip")
}
rawCounts <- fread("downloads/rawCounts.tsv",
sep = "\t", header = FALSE,
data.table = FALSE
)
# Genenames are stored in first column
rownames(rawCounts) <- rawCounts[, 1]
rawCounts <- rawCounts[, -1]
Of note: The file is 65 MB in size while the unzipped, raw counts measure 873 MB in size.
The cluster labels of the original publication ca be obtained via:
if (!file.exists("downloads/cellAnnotation.tsv")) {
website <- "https://www.ebi.ac.uk/"
folder <- "arrayexpress/files/E-MTAB-6153/"
file <- "E-MTAB-6153.processed.3.zip"
download.file(
paste0(website, folder, file),
destfile = "cluster_labels.zip"
)
unzip(zipfile = "cluster_labels.zip", exdir = "downloads")
}
cluster_labels <- read.table("downloads/cellAnnotation.tsv",
sep = "\t", header = TRUE, stringsAsFactors = FALSE)
We select the somitic and pre-somitic mesoderm cells to perform differential testing. Prior to running the MCMC, we want to control for outlying cells and heterogeneous substructure in both cell populations.
ind_som <- which(cluster_labels$cellType == "presomiticMesoderm" |
cluster_labels$cellType == "somiticMesoderm")
rawCounts <- rawCounts[, ind_som]
cluster_labels <- cluster_labels[ind_som, ]
For pre-processing and visualization purposes, we load the data into a SingleCellExperiment object. The metadata will be stored in the colData slot.
droplet_sce <- SingleCellExperiment(
assays = list(counts = as(as.matrix(rawCounts), "dgCMatrix"))
)
rm(rawCounts)
colData(droplet_sce) <- DataFrame(
cluster_labels,
subCellType = sub("_.*", "", cluster_labels$cell)
)
For further processing steps, we remove lowly expressed genes.
ind_expressed <- Matrix::rowMeans(counts(droplet_sce)) > 0.1
droplet_sce <- droplet_sce[ind_expressed, ]
To visualize possible sub-structure in the data, we normalize both cell populations using the scran package.
droplet_sce <- computeSumFactors(droplet_sce,
clusters = colData(droplet_sce)$subCellType
)
droplet_sce <- logNormCounts(droplet_sce)
Next, we compute a PCA using the scater package.
droplet_sce <- runPCA(droplet_sce)
We can now visualize the different factors stored in the colData slot.
# Cell types identified by clustering
plotReducedDim(droplet_sce, dimred = "PCA", colour_by = "subCellType") +
scale_fill_manual(values = c("coral4", "steelblue", "limegreen"))
The first PC separates the two different cell types while the second PC captures outlying cells. We will remove these outliers and the intermediate cell population from down-stream analysis.
ind_retain <- reducedDims(droplet_sce)$PCA[, 2] > -5 &
colData(droplet_sce)$subCellType != "presomiticMesoderm.b"
droplet_sce <- droplet_sce[, ind_retain]
We now collected the cells that we want to process using BASiCS. For this, we will generate the BASiCS data objects.
Since droplet-based scRNA-Seq data are generated without including technical spike-in genes, BASiCS uses measurement error models to quantify technical variation through replication (Carroll 1998). Here, it is crucial to provide batch information to the BASiCS model. In the case of the somitic and pre-somoitic mesoderm cells, embryos of two mice have been used to generate the data. Cells isolated from the first embryo were split into two batches and processed independently. To capture cell-type extrinsic, biological variation between the two mice, we pool cells from the two batches of the first animal and only considere cells from mouse 1 and mouse 2 as replicates.
# Presomitic mesoderm
ind_presom <- colData(droplet_sce)$cellType == "presomiticMesoderm"
cur_counts <- droplet_sce[, ind_presom]
cur_batch <- round(colData(cur_counts)$sample, digits = 0)
PSM_Data <- newBASiCS_Data(
Counts = as.matrix(counts(cur_counts)),
Tech = rep(FALSE, nrow(droplet_sce)),
SpikeInfo = NULL,
BatchInfo = cur_batch
)
# Somitic mesoderm
ind_som <- colData(droplet_sce)$cellType == "somiticMesoderm"
cur_counts <- droplet_sce[, ind_som]
cur_batch <- round(colData(cur_counts)$sample, digits = 0)
SM_Data <- newBASiCS_Data(
Counts = as.matrix(counts(cur_counts)),
Tech = rep(FALSE, nrow(droplet_sce)),
SpikeInfo = NULL,
BatchInfo = cur_batch
)
We next estimate model parameters by running the MCMC cell-type specifically. Due to the high cell number (1150 for the pre-somitic mesoderm and 739 for the somitic mesoderm), we set the number of iterations to 20000. In this case, we used the regression BASiCS model to additionally estimate residual over-dispersion parameters.
# Presomitic mesoderm cells
PSM_MCMC <- BASiCS_MCMC(
PSM_Data,
N = 20000,
Thin = 10,
Burn = 10000,
Regression = TRUE
)
# Somitic mesoderm cells
SM_MCMC <- BASiCS_MCMC(
SM_Data,
N = 20000,
Thin = 10,
Burn = 10000,
Regression = TRUE
)
Running these MCMC will take around 8-12 hours on a standard PC (2.6 GHz i5, 8 GB RAM, using 1 core). Here, we provide these chains to download from:
website <- "https://git.ecdf.ed.ac.uk/"
folder <- "vallejosgroup/basicsworkflow2020/raw/master/MCMCs/"
PSM_MCMC <- readRDS(
file = url(
paste0(website, folder, "PSM_MCMC.rds")
)
)
SM_MCMC <- readRDS(
file = url(
paste0(website, folder, "SM_MCMC.rds")
)
)
Next, we visualize the results of the MCMC sampler by visualizing the different chains and by plotting the regression trend. To assess whether the chains converged, we will visualize trace plots for some of the parameters. The mcmc function of the coda package is suited to generate trace plots for each chain.
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "mu")[, 120]), main = "mu[120]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "delta")[, 10]), main = "delta[10]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "epsilon")[, 20]), main = "epsilon[20]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "s")[, 90]), main = "s[90]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "nu")[, 100]), main = "nu[100]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "beta")[, 2]), main = "beta[2]")
plot(mcmc(displayChainBASiCS(PSM_MCMC, Param = "sigma2")), main = "sigma2")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "mu")[, 120]), main = "mu[120]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "delta")[, 10]), main = "delta[10]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "epsilon")[, 20]), main = "epsilon[20]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "s")[, 90]), main = "s[90]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "nu")[, 100]), main = "nu[100]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "beta")[, 2]), main = "beta[2]")
plot(mcmc(displayChainBASiCS(SM_MCMC, Param = "sigma2")), main = "sigma2")
We observe that the chains for all chosen paramteres converged. Furthermore, to validate that the model fitted the mean-variability trend correctly, we plot posterior estimates for over-dispersion paramters \(\delta_i\) against posterior estimates of mean expression parameters \(\mu_i\). For this, the BASiCS_ShowFit function can be used.
BASiCS_ShowFit(PSM_MCMC)
BASiCS_ShowFit(SM_MCMC)
Both trends display similar behaviour which allows us to compare residual over-dispersion estimates.
Next, we test for changes in mean expression and expression variability between the somitic and pre-somitic mesoderm. First, we are interested in assessing global changes in expression variability between the two conditions. For this, over-dispersion paramters \(\delta_i\) for genes that are similarly expressed in both conditions are compared.
droplet_test_logFC0 <- BASiCS_TestDE(
Chain1 = PSM_MCMC,
Chain2 = SM_MCMC,
EpsilonM = 0,
GroupLabel1 = "PSM",
GroupLabel2 = "SM",
Plot = FALSE,
PlotOffset = FALSE,
CheckESS = TRUE,
MinESS = 100
)
not_excluded <- droplet_test_logFC0$TableDisp$ResultDiffDisp != "ExcludedFromTesting"
for_plot <- droplet_test_logFC0$TableDisp[not_excluded, c("Disp1", "Disp2")]
for_plot <- melt(for_plot)
ggplot(for_plot) + geom_boxplot(aes(variable, log(value))) +
scale_x_discrete(labels = c("PSM", "SM")) + theme_minimal(base_size = 15) +
ylab("log(delta)") + xlab("")
With this analysis, we do not detect global changes in expression variability. We next profile changes in mean expression and expression variability on a gene-specific level. For this, we use a log2 fold change threshold of 1 for mean expression testing and the default threshold of \(\psi_0\approx0.41\) for differential variability testing.
droplet_test <- BASiCS_TestDE(
Chain1 = PSM_MCMC,
Chain2 = SM_MCMC,
EpsilonM = 1,
GroupLabel1 = "PSM",
GroupLabel2 = "SM",
Plot = FALSE,
PlotOffset = FALSE,
CheckESS = TRUE,
MinESS = 100
)
# Differential expression
ggplot(droplet_test$TableMean) +
geom_point(aes(log(MeanOverall), MeanLog2FC, colour = ResultDiffMean)) +
theme_minimal(base_size = 15) +
scale_colour_manual(
name = "Differential expression",
values = c("tan", "grey80", "#2166ac", "#b2182b")
) +
ylab(expression(mu[PSM] / mu[SM])) + xlab(expression(log(mu)))
# Differential variability
ggplot(droplet_test$TableResDisp) +
geom_point(aes(log(MeanOverall), ResDispDistance, colour = ResultDiffResDisp)) +
theme_minimal(base_size = 15) +
scale_colour_manual(
name = "Differential variability",
values = c("tan", "grey80", "#542788", "#b35806")
) +
ylab(expression(epsilon[PSM] - epsilon[SM])) + xlab(expression(log(mu)))
We can now list the genes that were detected as differenitally expressed and differentially variable ordered by their difference in mean expression/variability. We first focus on genes that are differentially expressed between the two cell types.
# Highly expressed in somitic mesoderm
ind_sm <- droplet_test$TableMean$ResultDiffMean == "SM+"
SM_mean <- droplet_test$TableMean[ind_sm, ]
SM_mean <- SM_mean[order(SM_mean$MeanLog2FC, decreasing = FALSE), ]
SM_mean$Symbol <- genenames[SM_mean$GeneName, 2]
# Highly expressed in pre-somitic mesoderm
ind_psm <- droplet_test$TableMean$ResultDiffMean == "PSM+"
PSM_mean <- droplet_test$TableMean[ind_psm, ]
PSM_mean <- PSM_mean[order(PSM_mean$MeanLog2FC, decreasing = TRUE), ]
PSM_mean$Symbol <- genenames[PSM_mean$GeneName, 2]
We can next perform GO analysis on up- or down-regulated genes. First, we will perform GO analysis on somitic mesoderm specific genes.
# Collect significan genes as 1 and all other as 0
SM_genes <- as.integer(droplet_test$TableMean$ResultDiffMean == "SM+")
names(SM_genes) <- droplet_test$TableMean$GeneName
# Build a Null distribution by correcting the gene length bias
pwf <- nullp(SM_genes, "mm10", "ensGene", bias.data = genelength[names(SM_genes)])
GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
SM_GO <- DataFrame(GO_wall[ind_signif, ])
# Add genenames to the GO categories
all_genes <- vector(length = nrow(SM_GO))
for (j in 1:nrow(SM_GO)) {
allegs <- get(SM_GO$category[j], org.Mm.eg.db::org.Mm.egGO2ALLEGS)
genes <- unique(unlist(mget(allegs, org.Mm.eg.db::org.Mm.egENSEMBL)))
genes <- as.character(intersect(genes, SM_mean$GeneName))
all_genes[j] <- paste(genes, collapse = ", ")
}
SM_GO$Genes <- all_genes
Now, we perform GO analysis on pre-somitic mesoderm specific genes
# Collect significan genes as 1 and all other as 0
PSM_genes <- as.integer(droplet_test$TableMean$ResultDiffMean == "PSM+")
names(PSM_genes) <- droplet_test$TableMean$GeneName
# Build a Null distribution by correcting the gene length bias
pwf <- nullp(
PSM_genes,
"mm10",
"ensGene",
bias.data = genelength[names(PSM_genes)]
)
GO_wall <- goseq(pwf, "mm10", "ensGene")
ind_signif <- p.adjust(GO_wall$over_represented_pvalue, method = "fdr") < 0.01
PSM_GO <- DataFrame(GO_wall[ind_signif, ])
# Add genenames to the GO categories
all_genes <- vector(length = nrow(PSM_GO))
for (j in 1:nrow(PSM_GO)) {
allegs <- get(PSM_GO$category[j], org.Mm.eg.db::org.Mm.egGO2ALLEGS)
genes <- unique(unlist(mget(allegs, org.Mm.eg.db::org.Mm.egENSEMBL)))
genes <- as.character(intersect(genes, PSM_mean$GeneName))
all_genes[j] <- paste(genes, collapse = ", ")
}
PSM_GO$Genes <- all_genes
To visualize the expression of individual genes, we can use the scater package.
# Expression of Fgf8 in both conditions
ind_fgf <- genenames$external_gene_name == "Fgf8"
plotExpression(droplet_sce,
features = genenames[ind_fgf, 1],
x = "cellType", colour_by = "cellType"
)
plotReducedDim(droplet_sce,
dimred = "PCA",
colour_by = genenames[ind_fgf, 1]
)
Visualize one category in form of heatmap.
genes <- unlist(strsplit(PSM_GO[1, "Genes"], ", "))
for_heatmap <- logcounts(droplet_sce)[genes, ]
colnames(for_heatmap) <- colData(droplet_sce)$cell
# Order cells by cell type
for_heatmap <- for_heatmap[, order(colnames(for_heatmap))]
# Order rows by log2FC
heatmap_ind <- match(rownames(for_heatmap), PSM_mean$GeneName)
heatmap_order <- order(PSM_mean[heatmap_ind, "MeanLog2FC"], decreasing = TRUE)
for_heatmap <- for_heatmap[heatmap_order, ]
pheatmap(for_heatmap,
cluster_cols = FALSE, show_colnames = FALSE, cluster_rows = FALSE,
color = colorRampPalette(c("#053061", "#4393c3", "white", "#d6604d", "#67001f"))(100),
labels_row = genenames[rownames(for_heatmap), 2],
cellheight = 8, fontsize = 7, annotation_col = data.frame(
row.names = colnames(for_heatmap),
cellType = sub("_.*", "", colnames(for_heatmap))
), scale = "row"
)
Next, we are interested in genes that change in variability between the two cell types. To help interpretion of the result, we will split the genes into four categories. These include: * More variable in SM, highly expressed in SM * More variable in SM, lowly expressed in SM * More variable in PSM, highly expressed in PSM * More variable in PSM, lowly expressed in SM
gene_groups <- data.frame(
Genename = droplet_test$TableResDisp$GeneName,
Symbol = genenames[droplet_test$TableResDisp$GeneName, 2],
MeanLog2FC = droplet_test$TableMean$MeanLog2FC,
ResDispDistance = droplet_test$TableResDisp$ResDispDistance,
Regulation = paste(
droplet_test$TableMean$ResultDiffMean,
droplet_test$TableResDisp$ResultDiffResDisp,
sep = "_"
)
)
ind_reg <- gene_groups$Regulation %in% c("SM+_SM+", "SM+_PSM+", "PSM+_PSM+", "PSM+_SM+")
gene_groups <- gene_groups[ind_reg, ]
gene_groups[gene_groups$Regulation == "SM+_SM+", ]
## Genename Symbol MeanLog2FC ResDispDistance Regulation
## 20 ENSMUSG00000025921 Rdh10 -2.949 -1.300 SM+_SM+
## 130 ENSMUSG00000026185 Igfbp5 -3.572 -2.587 SM+_SM+
## 239 ENSMUSG00000009772 Nuak2 -2.631 -2.516 SM+_SM+
## 258 ENSMUSG00000009418 Nav1 -1.326 -1.792 SM+_SM+
## 314 ENSMUSG00000026586 Prrx1 -4.074 -2.004 SM+_SM+
## 364 ENSMUSG00000026544 Dusp23 -2.307 -1.552 SM+_SM+
## 707 ENSMUSG00000027173 Depdc7 -4.176 -1.985 SM+_SM+
## 843 ENSMUSG00000051379 Flrt3 -2.029 -1.164 SM+_SM+
## 927 ENSMUSG00000067786 Nnat -1.224 -0.983 SM+_SM+
## 1545 ENSMUSG00000059857 Ntng1 -2.938 -2.258 SM+_SM+
## 1574 ENSMUSG00000028023 Pitx2 -1.184 -1.994 SM+_SM+
## 1677 ENSMUSG00000028289 Epha7 -2.611 -2.685 SM+_SM+
## 1937 ENSMUSG00000050212 Eva1b -2.442 -2.207 SM+_SM+
## 2041 ENSMUSG00000041120 Nbl1 -3.425 -2.654 SM+_SM+
## 2070 ENSMUSG00000019055 Plod1 -1.125 -2.291 SM+_SM+
## 2154 ENSMUSG00000002944 Cd36 -3.484 -1.624 SM+_SM+
## 2282 ENSMUSG00000029178 Klf3 -2.174 -1.811 SM+_SM+
## 2407 ENSMUSG00000054675 Tmem119 -2.437 -1.510 SM+_SM+
## 2571 ENSMUSG00000029718 Pcolce -1.284 -2.672 SM+_SM+
## 2652 ENSMUSG00000032766 Gng11 -1.360 -1.590 SM+_SM+
## 2654 ENSMUSG00000029661 Col1a2 -1.406 -2.256 SM+_SM+
## 2658 ENSMUSG00000032667 Pon2 -1.488 -2.071 SM+_SM+
## 2659 ENSMUSG00000042607 Asb4 -2.689 -2.476 SM+_SM+
## 2738 ENSMUSG00000039419 Cntnap2 -4.377 -2.858 SM+_SM+
## 2761 ENSMUSG00000085412 Halr1 -4.146 -2.782 SM+_SM+
## 2810 ENSMUSG00000073002 Vamp5 -3.511 -2.617 SM+_SM+
## 2811 ENSMUSG00000050732 Vamp8 -2.136 -1.216 SM+_SM+
## 2930 ENSMUSG00000061353 Cxcl12 -1.380 -2.302 SM+_SM+
## 3165 ENSMUSG00000037166 Ppp1r14a -1.066 -1.392 SM+_SM+
## 3181 ENSMUSG00000006311 Etv2 -1.299 -3.290 SM+_SM+
## 3340 ENSMUSG00000039405 Prss23 -3.337 -3.102 SM+_SM+
## 3578 ENSMUSG00000025511 Tspan4 -1.309 -1.862 SM+_SM+
## 3663 ENSMUSG00000069662 Marcks -1.059 -0.769 SM+_SM+
## 3789 ENSMUSG00000004665 Cnn2 -1.657 -1.028 SM+_SM+
## 3898 ENSMUSG00000036602 Alx1 -2.947 -2.635 SM+_SM+
## 4070 ENSMUSG00000039620 Trmt9b -1.836 -1.506 SM+_SM+
## 4095 ENSMUSG00000031520 Vegfc -1.504 -1.543 SM+_SM+
## 4241 ENSMUSG00000031734 Irx3 -2.732 -2.032 SM+_SM+
## 4243 ENSMUSG00000031737 Irx5 -2.630 -2.212 SM+_SM+
## 4264 ENSMUSG00000031673 Cdh11 -2.248 -2.452 SM+_SM+
## 4396 ENSMUSG00000025810 Nrp1 -3.474 -1.365 SM+_SM+
## 4462 ENSMUSG00000021892 Sh3bp5 -2.630 -2.523 SM+_SM+
## 4476 ENSMUSG00000021792 Prxl2a -1.088 -2.154 SM+_SM+
## 4586 ENSMUSG00000022037 Clu -1.844 -3.881 SM+_SM+
## 4721 ENSMUSG00000035919 Bbs9 -2.541 -1.449 SM+_SM+
## 4889 ENSMUSG00000032231 Anxa2 -2.181 -1.616 SM+_SM+
## 4957 ENSMUSG00000046997 Spsb4 -1.985 -1.467 SM+_SM+
## 5083 ENSMUSG00000054871 Tmem158 -1.173 -2.216 SM+_SM+
## 5208 ENSMUSG00000057098 Ebf1 -2.113 -2.247 SM+_SM+
## 5372 ENSMUSG00000060600 Eno3 -2.537 -1.489 SM+_SM+
## 5398 ENSMUSG00000000753 Serpinf1 -2.493 -2.211 SM+_SM+
## 5502 ENSMUSG00000069763 Tmem100 -2.166 -3.567 SM+_SM+
## 5663 ENSMUSG00000000567 Sox9 -1.729 -2.593 SM+_SM+
## 5767 ENSMUSG00000021215 Net1 -1.502 -2.360 SM+_SM+
## 5864 ENSMUSG00000021453 Gadd45g -1.349 -0.896 SM+_SM+
## 5908 ENSMUSG00000052957 Gas1 -1.301 -1.295 SM+_SM+
## 5914 ENSMUSG00000033114 Slc35d2 -1.538 -2.662 SM+_SM+
## 5942 ENSMUSG00000021591 Glrx -2.298 -2.549 SM+_SM+
## 5947 ENSMUSG00000069171 Nr2f1 -2.990 -3.441 SM+_SM+
## 5956 ENSMUSG00000021613 Hapln1 -3.705 -2.481 SM+_SM+
## 5986 ENSMUSG00000078302 Foxd1 -4.275 -1.705 SM+_SM+
## 6016 ENSMUSG00000021701 Plk2 -1.676 -2.312 SM+_SM+
## 6105 ENSMUSG00000036144 Meox2 -3.769 -2.930 SM+_SM+
## 6198 ENSMUSG00000021127 Zfp36l1 -1.104 -1.943 SM+_SM+
## 6241 ENSMUSG00000047414 Flrt2 -3.981 -1.667 SM+_SM+
## 6795 ENSMUSG00000048490 Nrip1 -1.233 -1.409 SM+_SM+
## 6869 ENSMUSG00000048826 Dact2 -3.919 -1.800 SM+_SM+
## 7242 ENSMUSG00000005836 Gata6 -3.345 -3.975 SM+_SM+
## 7354 ENSMUSG00000050875 Minar2 -3.444 -2.837 SM+_SM+
## 7530 ENSMUSG00000024663 Rab3il1 -2.653 -2.179 SM+_SM+
## 7556 ENSMUSG00000024713 Pcsk5 -1.147 -2.220 SM+_SM+
## 7573 ENSMUSG00000048138 Dmrt2 -4.854 -4.142 SM+_SM+
## 7618 ENSMUSG00000018822 Sfrp5 -3.713 -3.884 SM+_SM+
## 7665 ENSMUSG00000025026 Add3 -1.294 -1.742 SM+_SM+
We can visualize individual genes using scater.
# Expression of Meox2 in both conditions
ind_meox <- genenames$external_gene_name == "Meox2"
plotExpression(droplet_sce,
features = genenames[ind_meox, 1],
x = "cellType", colour_by = "cellType"
) +
scale_fill_manual(values = c("coral4", "limegreen"))
plotReducedDim(droplet_sce,
dimred = "PCA",
colour_by = genenames[ind_meox, 1]
)
This section is required if the paper does not include novel data or analyses. It allows authors to briefly summarize the key points from the article.
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.6.1 (2019-07-05)
## os Ubuntu 18.04.4 LTS
## system x86_64, linux-gnu
## ui unknown
## language en_GB:en
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/London
## date 2020-02-27
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib
## AnnotationDbi * 1.48.0 2019-10-29 [1]
## AnnotationFilter * 1.10.0 2019-10-29 [1]
## askpass 1.1 2019-01-13 [1]
## assertthat 0.2.1 2019-03-21 [1]
## backports 1.1.5 2019-10-02 [1]
## BASiCS * 1.8.0 2019-10-29 [1]
## beeswarm 0.2.3 2016-04-25 [1]
## BiasedUrn * 1.07 2015-12-28 [1]
## Biobase * 2.46.0 2019-10-29 [1]
## BiocFileCache 1.10.2 2019-11-08 [1]
## BiocGenerics * 0.32.0 2019-10-29 [1]
## BiocManager 1.30.10 2019-11-16 [1]
## BiocNeighbors 1.4.1 2019-11-01 [1]
## BiocParallel * 1.20.0 2019-10-30 [1]
## BiocSingular 1.2.0 2019-10-29 [1]
## BiocStyle * 2.14.0 2019-10-29 [1]
## BiocWorkflowTools 1.13.1 2020-02-27 [1]
## biomaRt * 2.42.0 2019-10-29 [1]
## Biostrings 2.54.0 2019-10-29 [1]
## bit 1.1-15.2 2020-02-10 [1]
## bit64 0.9-7 2017-05-08 [1]
## bitops 1.0-6 2013-08-17 [1]
## blob 1.2.1 2020-01-20 [1]
## bookdown 0.17 2020-01-11 [1]
## callr 3.4.2 2020-02-12 [1]
## cli 2.0.1 2020-01-08 [1]
## coda * 0.19-3 2019-07-05 [1]
## codetools 0.2-16 2018-12-24 [1]
## colorspace 1.4-1 2019-03-18 [1]
## cowplot 1.0.0 2019-07-11 [1]
## crayon 1.3.4 2017-09-16 [1]
## curl 4.3 2019-12-02 [1]
## data.table * 1.12.8 2019-12-09 [1]
## DBI 1.1.0 2019-12-15 [1]
## dbplyr 1.4.2 2019-06-17 [1]
## DelayedArray * 0.12.0 2019-10-29 [1]
## DelayedMatrixStats 1.8.0 2019-10-29 [1]
## desc 1.2.0 2018-05-01 [1]
## devtools 2.2.2 2020-02-17 [1]
## digest 0.6.24 2020-02-12 [1]
## dplyr 0.8.4 2020-01-31 [1]
## dqrng 0.2.1 2019-05-17 [1]
## edgeR 3.28.0 2019-10-29 [1]
## ellipsis 0.3.0 2019-09-20 [1]
## EnsDb.Mmusculus.v79 * 2.99.0 2019-03-28 [1]
## ensembldb * 2.10.0 2019-10-29 [1]
## evaluate 0.14 2019-05-28 [1]
## fansi 0.4.1 2020-01-08 [1]
## farver 2.0.3 2020-01-16 [1]
## fastmap 1.0.1 2019-10-08 [1]
## fs 1.3.1 2019-05-06 [1]
## geneLenDataBase * 1.22.0 2019-11-05 [1]
## GenomeInfoDb * 1.22.0 2019-10-29 [1]
## GenomeInfoDbData 1.2.2 2019-11-14 [1]
## GenomicAlignments 1.22.1 2019-11-12 [1]
## GenomicFeatures * 1.38.0 2019-10-29 [1]
## GenomicRanges * 1.38.0 2019-10-29 [1]
## ggbeeswarm 0.6.0 2017-08-07 [1]
## ggExtra 0.9 2019-08-27 [1]
## ggplot2 * 3.2.1 2019-08-10 [1]
## ggpointdensity * 0.1.0 2019-08-28 [1]
## git2r 0.26.1 2019-06-29 [1]
## glue 1.3.1 2019-03-12 [1]
## GO.db 3.10.0 2019-11-14 [1]
## goseq * 1.38.0 2019-10-29 [1]
## gridExtra 2.3 2017-09-09 [1]
## gtable 0.3.0 2019-03-25 [1]
## hexbin 1.28.1 2020-02-03 [1]
## highr 0.8 2019-03-20 [1]
## hms 0.5.3 2020-01-08 [1]
## htmltools 0.4.0 2019-10-04 [1]
## httpuv 1.5.2 2019-09-11 [1]
## httr 1.4.1 2019-08-05 [1]
## igraph 1.2.4.2 2019-11-27 [1]
## IRanges * 2.20.0 2019-10-29 [1]
## irlba 2.3.3 2019-02-05 [1]
## KernSmooth 2.23-16 2019-10-15 [1]
## knitr * 1.28 2020-02-06 [1]
## labeling 0.3 2014-08-23 [1]
## later 1.0.0 2019-10-04 [1]
## lattice 0.20-40 2020-02-19 [1]
## lazyeval 0.2.2 2019-03-15 [1]
## lifecycle 0.1.0 2019-08-01 [1]
## limma 3.42.0 2019-10-29 [1]
## locfit 1.5-9.1 2013-04-20 [1]
## magick 2.3 2020-01-24 [1]
## magrittr 1.5 2014-11-22 [1]
## MASS 7.3-51.5 2019-12-20 [1]
## Matrix * 1.2-18 2019-11-27 [1]
## matrixStats * 0.55.0 2019-09-07 [1]
## memoise 1.1.0 2017-04-21 [1]
## mgcv 1.8-31 2019-11-09 [1]
## mime 0.9 2020-02-04 [1]
## miniUI 0.1.1.1 2018-05-18 [1]
## munsell 0.5.0 2018-06-12 [1]
## nlme 3.1-144 2020-02-06 [1]
## openssl 1.4.1 2019-07-18 [1]
## org.Mm.eg.db * 3.10.0 2019-10-24 [1]
## pheatmap * 1.0.12 2019-01-04 [1]
## pillar 1.4.3 2019-12-20 [1]
## pkgbuild 1.0.6 2019-10-09 [1]
## pkgconfig 2.0.3 2019-09-22 [1]
## pkgload 1.0.2 2018-10-29 [1]
## plyr 1.8.5 2019-12-10 [1]
## prettyunits 1.1.1 2020-01-24 [1]
## processx 3.4.2 2020-02-09 [1]
## progress 1.2.2 2019-05-16 [1]
## promises 1.1.0 2019-10-04 [1]
## ProtGenerics 1.18.0 2019-10-29 [1]
## ps 1.3.2 2020-02-13 [1]
## purrr 0.3.3 2019-10-18 [1]
## R6 2.4.1 2019-11-12 [1]
## rappdirs 0.3.1 2016-03-28 [1]
## RColorBrewer 1.1-2 2014-12-07 [1]
## Rcpp 1.0.3 2019-11-08 [1]
## RCurl 1.98-1.1 2020-01-19 [1]
## remotes 2.1.1 2020-02-15 [1]
## reshape2 1.4.3 2017-12-11 [1]
## rlang 0.4.4 2020-01-28 [1]
## rmarkdown 2.1 2020-01-20 [1]
## rprojroot 1.3-2 2018-01-03 [1]
## Rsamtools 2.2.1 2019-11-06 [1]
## RSQLite 2.2.0 2020-01-07 [1]
## rstudioapi 0.11 2020-02-07 [1]
## rsvd 1.0.3 2020-02-17 [1]
## rtracklayer 1.46.0 2019-10-29 [1]
## S4Vectors * 0.24.0 2019-10-29 [1]
## scales 1.1.0 2019-11-18 [1]
## scater * 1.14.3 2019-11-07 [1]
## scran * 1.14.3 2019-11-08 [1]
## sessioninfo 1.1.1 2018-11-05 [1]
## shiny 1.4.0 2019-10-10 [1]
## SingleCellExperiment * 1.8.0 2019-10-29 [1]
## statmod 1.4.34 2020-02-17 [1]
## stringi 1.4.6 2020-02-17 [1]
## stringr 1.4.0 2019-02-10 [1]
## SummarizedExperiment * 1.16.0 2019-10-29 [1]
## testthat * 2.3.1 2019-12-01 [1]
## tibble 2.1.3 2019-06-06 [1]
## tidyselect 1.0.0 2020-01-27 [1]
## tinytex 0.19 2020-01-14 [1]
## usethis 1.5.1.9000 2020-02-20 [1]
## vctrs 0.2.2 2020-01-24 [1]
## vipor 0.4.5 2017-03-22 [1]
## viridis * 0.5.1 2018-03-29 [1]
## viridisLite * 0.3.0 2018-02-01 [1]
## withr 2.1.2 2018-03-15 [1]
## xfun 0.12 2020-01-13 [1]
## XML 3.99-0.3 2020-01-20 [1]
## xtable 1.8-4 2019-04-21 [1]
## XVector 0.26.0 2019-10-29 [1]
## yaml 2.2.1 2020-02-01 [1]
## zlibbioc 1.32.0 2019-10-29 [1]
## source
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.0)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Github (grimbough/BiocWorkflowTools@1fc9933)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.0)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## Bioconductor
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Github (r-lib/usethis@2a3d134)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## CRAN (R 3.6.1)
## Bioconductor
## CRAN (R 3.6.1)
## Bioconductor
##
## [1] /home/alan/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library
TODO: Versions of all main Bioconductor packages I think sessionInfo suffices.
TODO: Links to Grun data and Ximenas data TODO: Links to MCMC chains
TODO: Software: All software used in this workflow is available as part of Bioconductor X.Y TODO: The source code of this workflow is available from: YYY TODO: Link to Github release version, source code TODO: License: ask Aaron
This section will be generated by the Editorial Office before publication. Authors are asked to provide some initial information to assist the Editorial Office, as detailed below.
‘No competing interests were disclosed’.
Please state who funded the work discussed in this article, whether it is your employer, a grant funder etc. Please do not list funding that you have that is not relevant to this specific piece of research. For each funder, please state the funder’s name, the grant number where applicable, and the individual to whom the grant was assigned. If your work was not funded by any grants, please include the line: ‘The author(s) declared that no grants were involved in supporting this work.’
This section should acknowledge anyone who contributed to the research or the article but who does not qualify as an author based on the criteria provided earlier (e.g. someone or an organization that provided writing assistance). Please state how they contributed; authors should obtain permission to acknowledge from all those mentioned in the Acknowledgments section.
Please do not list grant funding in this section.
Anders, Simon, and Wolfgang Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biology 11 (10). BioMed Central Ltd: R106. doi:10.1186/gb-2010-11-10-r106.
Antolović, Vlatka, Agnes Miermont, Adam M. Corrigan, and Jonathan R. Chubb. 2017. “Generation of Single-Cell Transcript Variability by Repression.” Current Biology 27: 1811–7. doi:10.1016/j.cub.2017.05.028.
Araki, Koichi, Masahiro Morita, Annelise G. Bederman, Bogumila T. Konieczny, Haydn T. Kissick, Nahum Sonenberg, and Rafi Ahmed. 2017. “Translation is actively regulated during the differentiation of CD8 + effector T cells.” Nature Immunology 18 (9). Nature Publishing Group: 1046–57. doi:10.1038/ni.3795.
Brennecke, Philip, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” doi:10.1038/nmeth.2645.
Brooks, Stephen P., and Andrew Gelman. 1998. “General Methods for Monitoring Convergence of Iterative Simulations.” Journal of Computational and Graphical Statistics 7 (4). Taylor & Francis: 434–55. doi:10.1080/10618600.1998.10474787.
Carroll, Raymond J. 1998. Measurement Error in Epidemiologic Studies. Chichester, UK: John Wiley & Sons, Ltd. doi:10.1002/9781118445112.stat05178.
Cowles, Mary Kathryn, and Bradley P. Carlin. 1996. “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review.” Journal of the American Statistical Association 91 (434). Taylor & Francis: 883–904. doi:10.1080/01621459.1996.10476956.
Delmans, Mihails, and Martin Hemberg. 2016. “Discrete distributional differential expression (D(3)E) - a tool for gene expression analysis of single-cell RNA-seq data.” BMC Bioinformatics 17 (1). BMC Bioinformatics: 110. doi:10.1186/s12859-016-0944-6.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40. doi:10.1093/bioinformatics/bti525.
Eberwine, James, and Junhyong Kim. 2015. “Cellular Deconstruction: Finding Meaning in Individual Cell Variation.” Trends in Cell Biology 25 (10). Elsevier Ltd: 569–78. doi:10.1016/j.tcb.2015.07.004.
Ecker, Simone, Vera Pancaldi, Alfonso Valencia, Stephan Beck, and Dirk S. Paul. 2017. “Epigenetic and Transcriptional Variability Shape Phenotypic Plasticity.” BioEssays, 1700148. doi:10.1002/bies.201700148.
Eling, Nils, Arianne C. Richard, Sylvia Richardson, John C. Marioni, and Catalina A. Vallejos. 2017. “Robust expression variability testing reveals heterogeneous T cell responses.” bioRxiv, 237214. doi:10.1101/237214.
———. 2018. “Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data.” Cell Systems 7 (3). Elsevier Inc.: 1–11. doi:10.1016/j.cels.2018.06.011.
Elowitz, Michael B, Arnold J Levine, Eric D Siggia, and Peter S Swain. 2002. “Stochastic gene expression in a single cell.” Science 297 (5584): 1183–6. doi:10.1126/science.1070919.
External RNA Controls Consortium. 2005. “Proposed methods for testing and selecting the ERCC external RNA controls.” BMC Genomics 6 (June 2004): 150. doi:10.1186/1471-2164-6-150.
Faure, Andre J., Jörn M. Schmiedel, and Ben Lehner. 2017. “Systematic Analysis of the Determinants of Gene Expression Noise in Embryonic Stem Cells.” Cell Systems 5 (5): 471–84. doi:10.1016/j.cels.2017.10.003.
Gelman, Andrew, John Carlin, Hal Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2014. Bayesian Data Analysis. CRC Press.
Goolam, Mubeen, Antonio Scialdone, Sarah J L Graham, Iain C. MacAulay, Agnieszka Jedrusik, Anna Hupalowska, Thierry Voet, John C. Marioni, and Magdalena Zernicka-Goetz. 2016. “Heterogeneity in Oct4 and Sox2 Targets Biases Cell Fate in 4-Cell Mouse Embryos.” Cell 165 (1). The Authors: 61–74. doi:10.1016/j.cell.2016.01.047.
Hagai, Tzachi, Xi Chen, Ricardo J Miragaia, Raghd Rostom, Tomás Gomes, Natalia Kunowska, Johan Henriksson, et al. 2018. “Gene expression variability across cells and species shapes innate immunity.” Nature 563: 197–202. doi:10.1038/s41586-018-0657-2.
Ibarra-Soria, Ximena, Wajid Jawaid, Blanca Pijuan-Sala, Vasileios Ladopoulos, Antonio Scialdone, David J. Jörg, Richard C.V. Tyser, et al. 2018. “Defining murine organogenesis at single-cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation.” Nature Cell Biology 20 (2): 127–34. doi:10.1038/s41556-017-0013-z.
Ilicic, Tomislav, Jong Kyoung Kim, Aleksandra A Kolodziejczyk, Frederik Otzen Bagger, Davis James Mccarthy, John C Marioni, and Sarah A Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biology 17 (29). Genome Biology: 1–15. doi:10.1186/s13059-016-0888-1.
Iwamoto, Kazunari, Yuki Shindo, and Koichi Takahashi. 2016. “Modeling Cellular Noise Underlying Heterogeneous Cell Responses in the Epidermal Growth Factor Signaling Pathway.” PLoS Computational Biology 12 (11): 1–18. doi:10.1371/journal.pcbi.1005222.
Jose, Ester San, Aldo Borroto, Florence Niedergang, Andres Alcover, and Balbino Alarcon. 2000. “Triggering the TCR complexes causes the downregulation of nonengaged receptors by signal transduction-dependent mechanism.” Immunity 12: 161–70.
Katayama, Shintaro, Virpi Töhönen, Sten Linnarsson, and Juha Kere. 2013. “SAMstrt: Statistical test for differential expression in single-cell transcriptome with spike-in normalization.” Bioinformatics 29 (22): 2943–5. doi:10.1093/bioinformatics/btt511.
Kernfeld, Eric M., Ryan M.J. Genga, Kashfia Neherin, Margaret E. Magaletta, Ping Xu, and René Maehr. 2018. “A Single-Cell Transcriptomic Atlas of Thymus Organogenesis Resolves Cell Types and Developmental Maturation.” Immunity 48: 1–13. doi:10.1016/j.immuni.2018.04.015.
Kharchenko, Peter V, Lev Silberstein, and David T Scadden. 2014. “Bayesian approach to single-cell differential expression analysis.” Nature Methods 11 (7): 740–2. doi:10.1038/nmeth.2967.
Kim, Beomseok, Eunmin Lee, and Jong Kyoung Kim. 2019. “Analysis of Technical and Biological Variabilityin Single-Cell RNA Sequencing.” In Computational Methods for Single-Cell Data Analysis, 1935:25–43. doi:10.1007/978-1-4939-9057-3.
Kiselev, Vladimir Yu, Tallulah S. Andrews, and Martin Hemberg. 2019. “Challenges in unsupervised clustering of single-cell RNA-seq data.” Nature Reviews Genetics 2018. Springer US. doi:10.1038/s41576-018-0088-9.
Kiviet, Daniel J., Philippe Nghe, Noreen Walker, Sarah Boulineau, Vanda Sunderlikova, and Sander J. Tans. 2014. “Stochasticity of metabolism and growth at the single-cell level.” Nature 514 (7522). Nature Publishing Group: 376–79. doi:10.1038/nature13582.
Klein, Allon M., Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li, Leonid Peshkin, David A. Weitz, and Marc W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5). Elsevier Inc.: 1187–1201. doi:10.1016/j.cell.2015.04.044.
Kolodziejczyk, Aleksandra A., Jong Kyoung Kim, Jason C.H. Tsang, Tomislav Ilicic, Johan Henriksson, Kedar N. Natarajan, Alex C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17: 471–85. doi:10.1016/j.stem.2015.09.011.
Lönnberg, Tapio, Valentine Svensson, Kylie R. James, Daniel Fernandez-Ruiz, Ismail Sebina, Ruddy Montandon, Megan S. F. Soon, et al. 2017. “Single-cell RNA-seq and computational analysis using temporal mixture modeling resolves Th1/Tfh fate bifurcation in malaria.” Science Immunology 2 (9): eaal2192. doi:10.1126/sciimmunol.aal2192.
Lun, Aaron T. L., Karsten Bach, and John C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biology 17 (1). Genome Biology: 75. doi:10.1186/s13059-016-0947-7.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A step-by-step workflow for basic analyses of single-cell RNA-seq data.” F1000Research 5 (2122). doi:10.12688/f1000research.9501.1.
Macosko, Evan Z., Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5). Elsevier: 1202–14. doi:10.1016/j.cell.2015.05.002.
Martinez-Jimenez, Celia P., Nils Eling, Hung-Chang Chen, Catalina A Vallejos, Aleksandra A Kolodziejczyk, Frances Connor, Lovorka Stojic, et al. 2017. “Aging increases cell-to-cell transcriptional variability upon immune stimulation.” Science 1436: 1433–6. doi:10.1126/science.aah4115.
McCarthy, Davis J., Kieran R. Campbell, Aaron T.L. Lun, and Quin F. Wills. 2017. “Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8): 1179–86. doi:10.1093/bioinformatics/btw777.
Morgan, Michael D., and John C. Marioni. 2018. “CpG island composition differences are a source of gene expression noise indicative of promoter responsiveness.” Genome Biology 19 (1). Genome Biology: 1–13. doi:10.1186/s13059-018-1461-x.
Ohnishi, Yusuke, Wolfgang Huber, Akiko Tsumura, Minjung Kang, Panagiotis Xenopoulos, Kazuki Kurimoto, Andrzej K Oleś, et al. 2014. “Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages.” Nature Cell Biology 16 (1). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 27–37. doi:10.1038/ncb2881.
Patange, Simona, Michelle Girvan, and Daniel R. Larson. 2018. “Single-cell systems biology: Probing the basic unit of information flow.” Current Opinion in Systems Biology 8. Elsevier Ltd: 7–15. doi:10.1016/j.coisb.2017.11.011.
Pijuan-Sala, Blanca, Jonathan A. Griffiths, Carolina Guibentif, Tom W. Hiscock, Wajid Jawaid, Fernando J. Calero-Nieto, Carla Mulas, et al. 2019. “A single-cell molecular map of mouse gastrulation and early organogenesis.” Nature 566 (7745): 490–95. doi:10.1038/s41586-019-0933-9.
Plummer, Martyn, Nicky Best, Kate Cowles, and Karen Vines. 2006. “CODA: Convergence Diagnosis and Output Analysis for Mcmc.” R News 6 (1): 7–11. http://oro.open.ac.uk/22547/.
Prakadan, Sanjay M., Alex K. Shalek, and David A. Weitz. 2017. “Scaling by shrinking: empowering single-cell ’omics’ with microfluidic devices.” Nature Reviews Genetics 18 (6): 345–61. doi:10.1038/nrg.2017.15.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2009. “edgeR: A Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. doi:10.1093/bioinformatics/btp616.
Shalek, Alex K, Rahul Satija, Joe Shuga, John J Trombetta, Dave Gennert, Diana Lu, Peilin Chen, et al. 2014. “Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.” Nature 510 (7505). Nature Publishing Group: 263–69. doi:10.1038/nature13437.
Stegle, Oliver, Sarah a. Teichmann, and John C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nature Reviews Genetics 16 (3). Nature Publishing Group: 133–45. doi:10.1038/nrg3833.
Tan, Thomas C. J., John Knight, Thomas Sbarrato, Kate Dudek, Anne E. Willis, and Rose Zamoyska. 2017. “Suboptimal T-cell receptor signaling compromises protein translation, ribosome biogenesis, and proliferation of mouse CD8 T cells.” Proceedings of the National Academy of Sciences 114 (30): E6117–E6126. doi:10.1073/pnas.1700939114.
Vallejos, Catalina A, Sylvia Richardson, and John C Marioni. 2016. “Beyond comparisons of means: understanding changes in gene expression at the single-cell level.” Genome Biology 17 (70). doi:10.1101/035949.
Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. 2017. “Normalizing single-cell RNA sequencing data: challenges and opportunities.” Nature Methods 14 (6). Nature Publishing Group: 565–71. doi:10.1038/nmeth.4292.
Vallejos, Catalina A., John C. Marioni, and Sylvia Richardson. 2015a. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLOS Computational Biology 11 (6): e1004333. doi:10.1371/journal.pcbi.1004333.
———. 2015b. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLOS Computational Biology 11: e1004333. doi:10.1371/journal.pcbi.1004333.
Villani, Alexandra Chloé, Rahul Satija, Gary Reynolds, Siranush Sarkizova, Karthik Shekhar, James Fletcher, Morgane Griesbeck, et al. 2017. “Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors.” Science 356 (6335). doi:10.1126/science.aah4573.
Wagner, Daniel E., Caleb Weinreb, Zach M. Collins, James A. Briggs, Sean G. Megason, and Allon M. Klein. 2018. “Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo.” Science 4362: 1–12. doi:10.1126/science.aar4362.
Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology 11. http://genomebiology.com/2010/11/2/R14.
Zheng, Grace X.Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nature Communications 8. Nature Publishing Group: 1–12. doi:10.1038/ncomms14049.
Zopf, Cristopher J., Katie Quinn, Joshua Zeidman, and Narendra Maheshri. 2013. “Cell-Cycle Dependence of Transcription Dominates Noise in Gene Expression.” PLoS Computational Biology 9 (7): 1–12. doi:10.1371/journal.pcbi.1003161.